Re: [Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?
- Original Message - Pacemaker as a scheduler in Mesos or Kubernates does sound like a very interesting idea. Packaging corosync into super privileged containers still doesn't make too much sense to me. What's the reason in isolating something and then giving it all permissions on a host machine? because soon everything will run in containers. Take a look at rhel atomic and the stuff coreos is doing. The only way pacemaker will exist on those distributions is if it lives in a super privileged container. On Mon, Feb 23, 2015 at 5:20 PM, Andrew Beekhof and...@beekhof.net wrote: On 10 Feb 2015, at 1:45 pm, Serge Dubrouski serge...@gmail.com wrote: Hello Steve, Are you sure that Pacemaker is the right product for your project? Have you checked Mesos/Marathon or Kubernates? Those are frameworks being developed for managing containers. And in a few years they'll work out that they need some HA features and try to retrofit them :-) In the meantime pacemaker is actually rather good at managing containers already and knows a thing or two about HA and how to bring up a complex stack of services. The one thing that would be really interesting in this area is using the our policy engine as the kubernates scheduler. So many ideas and so little time. On Sat Feb 07 2015 at 1:19:15 PM Steven Dake (stdake) std...@cisco.com wrote: Hi, I am working on Containerizing OpenStack in the Kolla project ( http://launchpad.net/kolla ). One of the key things we want to do over the next few months is add H/A support to our container tech. David Vossel had suggested using systemctl to monitor the containers themselves by running healthchecking scripts within the containers. That idea is sound. There is another technology called “super-privileged containers”. Essentially it allows more host access for the container, allowing the treatment of Pacemaker as a container rather than a RPM or DEB file. I’d like corosync to run in a separate container. These containers will communicate using their normal mechanisms in a super-privileged mode. We will implement this in Kolla. Where I am stuck is how does Pacemaker within a container control other containers in the host os. One way I have considered is using the docker —pid=host flag, allowing pacemaker to communicate directly with the host systemctl process. Where I am stuck is our containers don’t run via systemctl, but instead via shell scripts that are executed by third party deployment software. An example: Lets say a rabbitmq container wants to run: The user would run kolla-mgr deploy messaging This would run a small bit of code to launch the docker container set for messaging. Could pacemaker run something like Kolla-mgr status messaging To control the lifecycle of the processes? Or would we be better off with some systemd integration with kolla-mgr? Thoughts welcome Regards, -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Suggestions for managing HA of containers from within a Pacemaker container?
- Original Message - Hi, Hey Steve, Good to see you around :) I am working on Containerizing OpenStack in the Kolla project ( http://launchpad.net/kolla ). One of the key things we want to do over the next few months is add H/A support to our container tech. David Vossel had suggested using systemctl to monitor the containers themselves by running healthchecking scripts within the containers. That idea is sound. Knowing what I know about OpenStack HA now, that is a bad choice. There is another technology called “super-privileged containers”. Essentially it allows more host access for the container, allowing the treatment of Yep, this is the way to do it. My plan is to have pacemaker running in a container, and have pacemaker capable of launching resources within containers. We already have a Docker resource agent. You can find it here, https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker Using that agent, pacemaker can launch a docker container, and then monitor the container by performing health checks within the container. Here's an example of how I'm using this technique to manage a containerized apache instance. https://github.com/davidvossel/phd/blob/master/scenarios/docker-apache-ap.scenario#L96 Pacemaker as a container rather than a RPM or DEB file. I’d like corosync to run in a separate container. These containers will communicate using their I actually already got pacemaker+corosync running in a container for testing purposes. If you're interested you can checkout some of that work here, https://github.com/davidvossel/phd/tree/master/lib/docker . The phd_docker_utils.sh file holds most of the interesting parts. normal mechanisms in a super-privileged mode. We will implement this in Kolla. Where I am stuck is how does Pacemaker within a container control other containers in the host os. One way I have considered is using the docker —pid=host flag, allowing pacemaker to communicate directly with the host systemctl process. Where I am stuck is our containers don’t run via systemctl, but instead via shell scripts that are executed by third party deployment software. An example: Lets say a rabbitmq container wants to run: The user would run kolla-mgr deploy messaging yes, and from there kolla-mgr hands the containers off to pacemaker to manage. kolla is the orchestration, pacemaker is the scheduler for performing those tasks. This would run a small bit of code to launch the docker container set for messaging. Could pacemaker run something like Kolla-mgr status messaging To control the lifecycle of the processes? Or would we be better off with some systemd integration with kolla-mgr? Thoughts welcome Regards, -steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] please help
- Original Message - Pacemaker is only running on one node. Before it was running on both node. run, service pacemaker start, on the ams2 node. Thank you Best Regards Perminus, IT ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Gracefully failing reload operation
- Original Message - Hi, is there a way for resource agent to tell pacemaker that in some cases reload operation is insufficient to apply new resource definition and restart is required? I tried to return OCF_ERR_GENERIC, but that prevents resource from being started until failure-timeout lapses and cluster is rechecked. I believe if the resource instance attribute that is being updated is marked as 'unique' by the specific resource's metadata, that pacemaker will force a start/stop instead of allowing the reload. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm254549695664 -- David Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Version of libqb is too old: v0.13 or greater requried
- Original Message - Hi Everybody, I have compiled libqb 0.17.1 under Debian Jessie/testing amd64 as: tar zxvf libqb-v0.17.1.tar.gz cd libqb-0.17.1/ ./autogen.sh ./configure make -j8 make -j8 install Then after succesful builds of COROSYNC 2.3.4, CLUSTER-GLUE 1.0.12 and RESOURCE-AGENTS 3.9.5, compiling PACEMAKER 1.1.12 fails with: unzip Pacemaker-1.1.12.zip cd pacemaker-Pacemaker-1.1.12/ addgroup --system haclient ./autogen.sh ./configure [...] configure: error: in `/home/alexis/pacemaker-Pacemaker-1.1.12': configure: error: Version of libqb is too old: v0.13 or greater requried I have tried to pass some flags to ./configure, but I still get this error. what do you get when you run, pkg-config --version libqb also, make sure you don't have an old version of libqb installed that you did a make install over top. -- Vossel What am I doing wrong? Thanks for your help, -- Alexis de BRUYN alexis.mailingl...@de-bruyn.fr ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote not listening
- Original Message - Hi, my os is debian-wheezy i compiled and installed pacemaker-remote. Startup log: Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: main: Starting My problem is, that pacemaker remote is not listening on port 3121 By default pacemaker_remote should listen on 3121. This is odd. One thing I can think of. Take a look at /etc/sysconfig/pacemaker on the node running pacemaker_remote. Make sure there isn't a custom port set using the PCMK_remote_port variable. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html#_pacemaker_and_pacemaker_remote_options -- Vossel netstat -tulpen | grep 3121 netstat -alpen Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ACC ] STREAM LISTENING 6635 2859/pacemaker_remo @lrmd unix 2 [ ] DGRAM 6634 2859/pacemaker_remo ... ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker memory usage
- Original Message - Hi, We are trying to introduce clustering to our embedded environment using pacemaker and corosync. Our environment includes following packages and modules for the initial test: 1. Pacemaker – version 1.1.10 2. Corosync - version 2.3.4 3. Corosync.conf file find it attached to this email thread 4. 2 Nodes in the cluster (master and slave) 5. Test app1 that publishes some sensor data (only Master node publishes, slave node just receives the data) 6. Test app2 that receives the published data and outputs on the screen (only master node outputs the data on the screen) Our test has been successful and we are able to create a cluster with 2 nodes and everything seems to be working fine. However we observed that pacemaker is consuming approx. 80MB of RAM when both the test applications are alive. 80M is more than I would expect. One thing I know you can do is lower the IPC buffer size. That can be done in /etc/sysconfig/pacemaker. Set the PCMK_ipc_buffer to something smaller than whatever it defaults to. -- Vossel We would like to know if there are any configuration settings or fine tuning that we need to perform to reduce the memory usage. It is very critical to reduce the memory consumption as we are running it in embedded environment. Thanks, Santosh Bidaralli ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Avoid one node from being a target for resources migration
- Original Message - Hello. I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are DRBD master-slave, also they have a number of other services installed (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no DRBD/postgresql/... are installed at it, only corosync+pacemaker. But when I add resources to the cluster, a part of them are somehow moved to node3 and since then fail. Note than I have a colocation directive to place these resources to the DRBD master only and location with -inf for node3, but this does not help - why? How to make pacemaker not run anything at node3? All the resources are added in a single transaction: cat config.txt | crm -w -f- configure where config.txt contains directives and commit statement at the end. Below are crm status (error messages) and crm configure show outputs. root@node3:~# crm status Current DC: node2 (1017525950) - partition with quorum 3 Nodes configured 6 Resources configured Online: [ node1 node2 node3 ] Master/Slave Set: ms_drbd [drbd] Masters: [ node1 ] Slaves: [ node2 ] Resource Group: server fs (ocf::heartbeat:Filesystem): Started node1 postgresql (lsb:postgresql): Started node3 FAILED bind9 (lsb:bind9): Started node3 FAILED nginx (lsb:nginx): Started node3 (unmanaged) FAILED Failed actions: drbd_monitor_0 (node=node3, call=744, rc=5, status=complete, last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not installed postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete, last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown error bind9_monitor_0 (node=node3, call=757, rc=1, status=complete, last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown error nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed Here's what is going on. Even when you say never run this resource on node3 pacemaker is going to probe for the resource regardless on node3 just to verify the resource isn't running. The failures you are seeing monitor_0 failed indicate that pacemaker failed to be able to verify resources are running on node3 because the related packages for the resources are not installed. Given pacemaker's default behavior I'd expect this. You have two options. 1. install the resource related packages on node3 even though you never want them to run there. This will allow the resource-agents to verify the resource is in fact inactive. 2. If you are using the current master branch of pacemaker, there's a new location constraint option called 'resource-discovery=always|never|exclusive'. If you add the 'resource-discovery=never' option to your location constraint that attempts to keep resources from node3, you'll avoid having pacemaker perform the 'monitor_0' actions on node3 as well. -- Vossel root@node3:~# crm configure show | cat node $id=1017525950 node2 node $id=13071578 node3 node $id=1760315215 node1 primitive drbd ocf:linbit:drbd \ params drbd_resource=vlv \ op start interval=0 timeout=240 \ op stop interval=0 timeout=120 primitive fs ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/var/lib/vlv.drbd/root options=noatime,nodiratime fstype=xfs \ op start interval=0 timeout=300 \ op stop interval=0 timeout=300 primitive postgresql lsb:postgresql \ op monitor interval=10 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 primitive bind9 lsb:bind9 \ op monitor interval=10 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 primitive nginx lsb:nginx \ op monitor interval=10 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 group server fs postgresql bind9 nginx ms ms_drbd drbd meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location loc_server server rule $id=loc_server-rule -inf: #uname eq node3 colocation col_server inf: server ms_drbd:Master order ord_server inf: ms_drbd:promote server:start property $id=cib-bootstrap-options \ stonith-enabled=false \ last-lrm-refresh=1421079189 \ maintenance-mode=false ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Some questions on the currenct state
- Original Message - Hi Trevor, thank you for answering so fast. 2) Besides the fact that rpm packages are available do you know how to make rpm packages from git repository? ./autogen.sh ./configure make rpm That will generate rpms from the source tree. 4) Is RHEL 7.x using corosync 2.x and pacemaker plugin for cluster membership? no. RHEL 7.x uses corosync 2.x and the new corosync vote quorum api. The plugins are a thing of the past for rhel7. Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Trevor Hemsley [mailto:thems...@voiceflex.com] Gesendet: Montag, 12. Januar 2015 16:42 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] Some questions on the currenct state On 12/01/15 15:09, Andreas Mock wrote: Hi all, almost allways when I'm forced to do some major upgrades to our core machines in terms of hardware and/or software (OS) I'm forced to have a look at the current state of pacemaker based HA. Things are going on and things change. Projects converge and diverge, tool(s)/chains come and go and distributions marketing strategies change. Therefor I want to ask the following question in the hope list members deeply involved can answer easily. 1) Are there pacemaker packages für RHEL 6.6 and clones? When yes where? In the CentOS (etc) base/updates repos. For RHEL they're in the HA channel. 2) How can I create a pacemaker package 1.1.12 on my own from the git sources? It's already in base/updates. 3) How can I get the current versions of pcs and/or crmsh? Is pcs competitive to crmsh meanwhile? pcs is in el6.6 and now includes pcsd. You can get crmsh from an opensuse build repo for el6. 4) Is the pacemaker HA solution of RHEL 7.x still bound to use of cman? No 5) Where can I find a currenct workable version of the agents for RHEL 6.6 (and clones) and RHEL 7.x? Probably you want the resource-agents package. T ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote debian wheezy
- Original Message - Hi, what is the best way, to install in a debian wheezy vm the package pacemaker-remote? This package is in the debian repository not available. I have no clue. I just want to point out, if your host OS is debian wheezy and the pacemaker-remote package is in fact unavailable, it is possible the version of pacemaker shipped with wheezy doesn't even have the capability of managing pacemaker_remote nodes. -- Vossel Thanks! Regards, Thomas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker error after a couple week or month (David Vossel)
- Original Message - Hello David, I think I use the latest version from ubuntu, it is version 1.1.10 Do you think it has bug on it? There have been a number of fixes to the lrmd since v1.1.10. It is possible a couple of them could result in crashes. Again, without a backtrace from the lrmd core dump, it is difficult for me to advise whether or not your specific issue has been fixed. Building from source could yield better results for you. The pacemaker master branch is stable at the moment. lrmd related changes since 1.1.10 # git log --oneline Pacemaker-1.1.10^..HEAD | grep -e lrmd: 71b429c Low: lrmd: fix regression test LSBdummy install fb94901 Test: lrmd: Ensure the lsb script is executable 30d978e Low: lrmd: systemd stress tests 568e41d Fix: lrmd: Prevent glib assert triggered by timers being removed from mainloop more than once 977de97 High: lrmd: cancel pending async connection during disconnect d2d0cba Low: lrmd: ensures systemd python package is available when systemd tests run f0fe737 Fix: lrmd: fix rescheduling of systemd monitor op during start c0e8e6a Low: lrmd: prevent \n from being printed in exit reason output 2342835 High: lrmd: pass exit reason prefix to ocf scripts as env variable 412631c High: lrmd: store failed operation exit reason in cib ad083a8 Fix: lrmd: Log with the correct personality 718bf5b Test: lrmd: Update the systemd agent to test long running actions c78b4b8 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped 3bd6c30 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped 574fc49 Fix: lrmd: Prevent OCF agents from logging to random files due to value of setenv() being NULL 155c6aa Low: lrmd: wider use of defined literals fa8bd56 Fix: lrmd: Expose logging variables expected by OCF agents d9cc751 Fix: lrmd: Provide stderr output from agents if available, otherwise fall back to stdout 3adc781 Low: lrmd: clean up the agent's entire process group 348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed fa2954e Low: lrmd: Warning msg to indicate duplicate op merge has occurred b94d0e9 Low: lrmd: recurring op merger regression tests c29ab27 High: lrmd: Merge duplicate recurring monitor operations c1a326d Test: lrmd: Bump the lrmd test timeouts to avoid transient travis failures deead39 Low: lrmd: Install ping agent during lrmd regression test. aad79e2 Low: lrmd: Make ocf dummy agents executable with regression test in src tree 5c8c7a5 Test: lrmd: Kill uninstalled daemons by the correct name 8e90200 Test: lrmd: Fix upstart metadata test and install required OCF agents bbdd6e1 Test: lrmd: Allow regression tests to run from the source tree 87f4091 Low: lrmd: Send event alerting estabilished clients that a new client connection is created. 644752e Fix: lrmd: Correctly calculate metadata for the 'service' class ea7991f Fix: lrmd: Do not interrogate NULL replies from the server 1c14b9d Fix: lrmd: Correctly cancel monitor actions for lsb/systemd/service resources on cleaning up eca Doc: lrmd: Indicate which function recieves the proxied command ad4056f Test: lrmd: Drop the default verbosity for lrmd regression tests eb40d6a Fix: lrmd: Do not overwrite any existing operation status error -- Vossel Should I compile from the source? Best Regards, Ariee On Fri, Dec 19, 2014 at 8:27 PM, pacemaker-requ...@oss.clusterlabs.org wrote: Message: 2 Date: Fri, 19 Dec 2014 14:21:59 -0500 (EST) From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] pacemaker error after a couple week or month Message-ID: 102420175.739708.1419016919246.javamail.zim...@redhat.com Content-Type: text/plain; charset=utf-8 - Original Message - Hello, I have 2 active-passive fail over system with corosync and drbd. One system using 2 debian server and the other using 2 ubuntu server. The debian servers are for web server fail over and the ubuntu servers are for database server fail over. I applied the same configuration in the pacemaker. Everything works fine, fail over can be done nicely and also the file system synchronization, but in the ubuntu server, it was always has error after a couple week or month. The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed that ubuntu2 was down and ubuntu2 assumed that something happened with ubuntu1 but still alive and took over the resources. It made the drbd resource cannot be taken over, thus no fail over happened and we must manually restart the server because restarting pacemaker and corosync didn't help. I have changed the configuration of pacemaker a couple time, but the problem still exist. has anyone experienced it? I use Ubuntu 14.04.1 LTS. I got this error in apport.log ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable: /usr/lib/pacemaker/lrmd
Re: [Pacemaker] [Patch]Memory leak of Pacemakerd.
- Original Message - Hi All, Whenever a node to constitute a cluster repeats start and a stop, Pacemakerd of the node not to stop leaks out memory. I attached a patch. this patch looks correct. Can you create a pull request to our master branch? https://github.com/ClusterLabs/pacemaker -- Vossel Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [libqb]Unlink of files bound to sockets
- Original Message - I used the current trunk. I could not find the unlink calls. If domain sockets are used this two methods are used. In ./lib/ipc_socket.c static void qb_ipcc_us_disconnect(struct qb_ipcc_connection *c) { munmap(c-request.u.us.shared_data, SHM_CONTROL_SIZE); unlink(c-request.u.us.shared_file_name); right, so here we're doing the unlinking of the shared file. There's some trick we're using to only have a single file created for all 3 sockets. Is this not working for solaris? qb_ipcc_us_sock_close(c-event.u.us.sock); qb_ipcc_us_sock_close(c-request.u.us.sock); qb_ipcc_us_sock_close(c-setup.u.us.sock); } In ./lib/ipc_setup.c void qb_ipcc_us_sock_close(int32_t sock) { shutdown(sock, SHUT_RDWR); close(sock); } I added in the latter the unlink calls. -Ursprüngliche Nachricht- Von: David Vossel [mailto:dvos...@redhat.com] Gesendet: Donnerstag, 18. Dezember 2014 18:13 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] [libqb]Unlink of files bound to sockets - Original Message - I sent yesterday this email to the mailing list of libq 'quarterback-de...@lists.fedorahosted.org'. But there is nearly no activity since august. i saw the email. i flagged it so it would get a response. I use the current trunk of libqb. In qb_ipcc_us_sock_close nd qb_ipcs_us_withdraw of lib/ipc_setup.c sockets are closed. Is there a reason why the files bound to the sockets are not deleted with unlink? Is unlinking not necessary with Linux? Unlinking is required for linux. For client/server connections. qb_ipcc_us_disconnect unlinks on the client side. qb_ipcs_us_disconnect unlinks on the server side. I found thousands of files in statedir=/var/corosync/run after a while. What version of corosync are you using? There were some reference leaks for ipc connections in the corosync code we fixed a year or so ago that should have fixed this. -- David I tried this and it seems to work without errors. e.g. void qb_ipcc_us_sock_close(int32_t sock) { #ifdef QB_SOLARIS struct sockaddr_un un_addr; socklen_t un_addr_len = sizeof(struct sockaddr_un); #endif shutdown(sock, SHUT_RDWR); #ifdef QB_SOLARIS if (getsockname(sock, (struct sockaddr *)un_addr, un_addr_len) == 0) { if(strstr(un_addr.sun_path,-) != NULL) { qb_util_log(LOG_DEBUG, un_addr.sun_path=%s, un_addr.sun_path); unlink(un_addr.sun_path); } } else { qb_util_log(LOG_DEBUG, getsockname returned errno=%d, errno); } #endif close(sock); } Regards Andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker error after a couple week or month
- Original Message - Hello, I have 2 active-passive fail over system with corosync and drbd. One system using 2 debian server and the other using 2 ubuntu server. The debian servers are for web server fail over and the ubuntu servers are for database server fail over. I applied the same configuration in the pacemaker. Everything works fine, fail over can be done nicely and also the file system synchronization, but in the ubuntu server, it was always has error after a couple week or month. The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed that ubuntu2 was down and ubuntu2 assumed that something happened with ubuntu1 but still alive and took over the resources. It made the drbd resource cannot be taken over, thus no fail over happened and we must manually restart the server because restarting pacemaker and corosync didn't help. I have changed the configuration of pacemaker a couple time, but the problem still exist. has anyone experienced it? I use Ubuntu 14.04.1 LTS. I got this error in apport.log ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable: /usr/lib/pacemaker/lrmd (command line /usr/lib/pacemaker/lrmd) wow, it looks like the lrmd is crashing on you. I haven't seen this occur in the wild before. Without a backtrace it will be nearly impossible to determine what is happening. Do you have the ability to upgrade pacemaker to a newer version? -- Vossel ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report /var/crash/_usr_lib_pacemaker_lrmd.0.crash my pacemaker configuration: node $id=1 db \ attributes standby=off node $id=2 db2 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.0.100 cidr_netmask=24 \ op monitor interval=30s primitive DBase ocf:heartbeat:mysql \ meta target-role=Started \ op start timeout=120s interval=0 \ op stop timeout=120s interval=0 \ op monitor interval=20s timeout=30s primitive DbFS ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/sync fstype=ext4 \ op start timeout=60s interval=0 \ op stop timeout=180s interval=0 \ op monitor interval=60s timeout=60s primitive Links lsb:drbdlinks primitive r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=29s role=Master \ op start timeout=240s interval=0 \ op stop timeout=180s interval=0 \ op promote timeout=180s interval=0 \ op demote timeout=180s interval=0 \ op monitor interval=30s role=Slave group DbServer ClusterIP DbFS Links DBase ms ms_r0 r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Master location prefer-db DbServer 50: db colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start property $id=cib-bootstrap-options \ dc-version=1.1.10-42f2063 \ cluster-infrastructure=corosync \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1363370585 my corosync config: totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 3600 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: off threads: 0 rrp_mode: none transport: udpu cluster_name: Dbcluster } nodelist { node { ring0_addr: db nodeid: 1 } node { ring0_addr: db2 nodeid: 2 } } quorum { provider: corosync_votequorum } amf { mode: disabled } service { ver: 0 name: pacemaker } aisexec { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: yes logfile: /var/log/corosync/corosync.log to_syslog: no syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } my drbd.conf: global { usage-count no; } common { protocol C; handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; } startup { degr-wfc-timeout 120; } disk { on-io-error detach; } syncer { rate 100M; al-extents 257; } } resource r0 { protocol C; flexible-meta-disk internal; on db2 { address 192.168.0.10:7801 ; device /dev/drbd0 minor 0; disk /dev/sdb1; } on db { device /dev/drbd0 minor 0; disk /dev/db/sync; address 192.168.0.20:7801 ; } handlers { split-brain /usr/lib/drbd/notify-split-brain.sh root; } net { after-sb-0pri discard-younger-primary; #discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri
Re: [Pacemaker] Fencing of bare-metal remote nodes
- Original Message - 25.11.2014 23:41, David Vossel wrote: - Original Message - Hi! is subj implemented? Trying echo c /proc/sysrq-trigger on remote nodes and no fencing occurs. Yes, fencing remote-nodes works. Are you certain your fencing devices can handle fencing the remote-node? Fencing a remote-node requires a cluster node to invoke the agent that actually performs the fencing action on the remote-node. David, a couple of questions. I see that in your fencing tests you just stop systemd unit. Shouldn't pacemaker_remoted somehow notify crmd that it is being shutdown? And shouldn't crmd stop all resources on that remote node before granting that shutdown? yes, this needs to happen at some point. Right now the shutdown method for a remote-node is to disable the connection resource and wait for all the resources to stop before killing pacemaker_remoted on the remote node. That isn't exactly ideal. Also, from what I see now it would be natural to hide current implementation of remote node configuration under node/ syntax. Now remote nodes do have almost all features of normal nodes, including node attributes. What do you think about it? ha, well. yes. at this point that might make sense. I had originally never planned on remote-nodes entering the actual nodes section, but eventually that changed. I'd like for usage of remote nodes to mature a bit before I commit to changing something like this though. I'm still a bit uncertain how people are going to use baremetal remote nodes. The use cases people come up with keep surprising me. Keeping the remote node definition as a resource gives us a bit more flexibility for configuration. -- Vossel Best, Vladislav -- Vossel Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Fencing of bare-metal remote nodes
- Original Message - 26.11.2014 18:36, David Vossel wrote: - Original Message - 25.11.2014 23:41, David Vossel wrote: - Original Message - Hi! is subj implemented? Trying echo c /proc/sysrq-trigger on remote nodes and no fencing occurs. Yes, fencing remote-nodes works. Are you certain your fencing devices can handle fencing the remote-node? Fencing a remote-node requires a cluster node to invoke the agent that actually performs the fencing action on the remote-node. Yes, if I invoke fencing action manually ('crm node fence rnode' in crmsh syntax), node is fenced. So the issue seems to be related to the detection of a need fencing. Comments in related git commits are a little bit terse in this area. So could you please explain what exactly needs to happen on a remote node to initiate fencing? I tried so far: * kill pacemaker_remoted when no resources are running. systemd restated it and crmd reconnected after some time. This should definitely cause the remote-node to be fenced. I tested this earlier today after reading you were having problems and my setup fenced the remote-node correctly. * crash kernel when no resources are running If a remote-node connection is lost and pacemaker was able to verify the node is clean before the connection is lost, pacemaker will attempt to reconnect to the remote-node without issuing a fencing request. I could see why both fencing and not fencing in this situation could be desired. Maybe i should make an option. * crash kernel during massive start of resources This should definitely cause the remote node to be fenced. this last one should definitely cause fencing. What version of pacemaker are you using? I've made changes in this area recently. Can you provide a crm_report. It's c191bf3. crm_report is ready, but I still wait an approval from a customer to send it. Great. I really need to see what you all are doing. Outside of my own setup I have not seen many setups where pacemaker remote deployed on baremetal nodes. It is possible something in your configuration exposes some edge case I haven't encountered yet. There's a US holiday Thrusday and Friday, so I won't be able to look at this until next week. -- Vossel -- David No fencing happened. In the last case that start actions 'hung' and were failed by timeout (it is rather long), node was not even listed as failed. My customer asked me to stop crashing nodes because one of them does not boot anymore (I like that modern UEFI hardware very much.), so it is hard for me to play more with that. Best, Vladislav -- Vossel Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Cluster-devel] [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015
- Original Message - On 25 Nov 2014, at 8:54 pm, Lars Marowsky-Bree l...@suse.com wrote: On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote: Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I personally disagree. In my experience, one either does a face-to-face meeting, or a virtual one that puts everyone on the same footing. Mixing both works really badly unless the team already knows each other. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? Personally I'm interested in talking about scaling - with pacemaker-remoted and/or a new messaging/membership layer. If we're going to talk about scaling, we should throw in our new docker support in the same discussion. Docker lends itself well to the pet vs cattle analogy. I see management of docker with pacemaker making quite a bit of sense now that we have the ability to scale into the cattle territory. Other design-y topics: - SBD - degraded mode - improved notifications - containerisation of services (cgroups, docker, virt) - resource-agents (upstream releases, handling of pull requests, testing) Yep, We definitely need to talk about the resource-agents. User-facing topics could include recent features (ie. pacemaker-remoted, crm_resource --restart) and common deployment scenarios (eg. NFS) that people get wrong. Adding to the list, it would be a good idea to talk about Deployment integration testing, what's going on with the phd project and why it's important regardless if you're interested in what the project functionally does. -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] exportfs resource agent modifications
- Original Message - Hi there, we are using exportfs for building datastores for VMWare. After migrating 1 NFS resource to another node (after sucessful fencing e.g.), VMWare doesn't see that datastore until I manually fire _exportfs -f_ on the new cluster node. I tried to modify the resource agent itself like: 247 restore_rmtab 248 249 ocf_log info File system exported 250 251 sleep 5 #added 252 253 ocf_run exportfs -f || exit $OCF_ERR_GENERIC #added 254 255 ocf_log info kernel table flushed #added 256 257 return $OCF_SUCCESS but this didn't do the trick. Does anyone has an idea how to resolve that issue? HA NFS is tricky and requires a very specific resource startup/shutdown order to work correctly. Here's some information about use cases I test. At this point, the active/passive use case is well understood. If you are able to, I would recommend modeling deployments using the A/P use case guidelines. HA NFS Active/Passive: https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-ap-overview.pdf?raw=true https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-passive.scenario HA NFS Active/Active: https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-aa-overview.pdf?raw=true https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario -- Vossel Cheers, Hauke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] HA setup of MySQL service using Pacemaker/DRBD
- Original Message - Hi, I have a working 2 node HA setup running on CentOS 6.5 with a very simple Apache webserver with replicated index.html using DRBD 8.4. The setup is configured based on the Clusters from Scratch Edition 5 with Fedora 13. I now with to replace Apache with a MySQL database, or just simply add it. How can I do so? I'm guessing the following: 1 . Add MySQL service to the cluster with a crm configure primitive command. I'm not sure what the params should be though, e.g. the configfile. 2. Set the same colocation/order rules. 3. Create/initialize a separate DRBD partition for MySQL (or can I reuse the same partition as Apache assuming I'll never exceed its capacity?) 4. Copy the database/table into the mounted DRBD partition. 5. Configure the cluster for DRBD as per Chapter 7.4 of the guide. Is this correct? Step by step instructions would be appreciated, I have some experience in RHEL/CentOS but not in HA nor MySQL. You've got the right idea. I don't have a step by step guide crmsh, but here's a basic MySQL deploy on shared storage that I test with pcs. I just mount the shared storage partition to the /var/lib/mysql directory, then start mysql. From there the database mysql uses exists on shared storage and can follow the mysql instance wherever it goes in the cluster. https://github.com/davidvossel/phd/blob/master/scenarios/mariadb-basic.scenario -- Vossel Thanks! -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] resource-stickiness not working?
- Original Message - Here is a simple Active/Passive configuration with a single Dummy resource (see end of message). The resource-stickiness default is set to 100. I was assuming that this would be enough to keep the Dummy resource on the active node as long as the active node stays healthy. However, stickiness is not working as I expected in the following scenario: 1) The node testnode1, which is running the Dummy resource, reboots or crashes 2) Dummy resource fails to node testnode2 3) testnode1 comes back up after reboot or crash 4) Dummy resource fails back to testnode1 I don't want the resource to failback to the original node in step 4. That is why resource-stickiness is set to 100. The only way I can get the resource to not to fail back is to set resource-stickiness to INFINITY. Is this the correct behavior of resource-stickiness? What am I missing? This is not what I understand from the documentation from clusterlabs.org. BTW, after reading various postings on fail back issues, I played with setting on-fail to standby, but that doesn't seem to help either. Any help is appreciated! I agree, this is curious. Can you attach a crm_report? Then we can walk through the transitions to figure out why this is happening. -- Vossel Scott node testnode1 node testnode2 primitive dummy ocf:heartbeat:Dummy \ op start timeout=180s interval=0 \ op stop timeout=180s interval=0 \ op monitor interval=60s timeout=60s migration-threshold=5 xml rsc_location id=cli-prefer-dummy rsc=dummy role=Started node=testnode2 score=INFINITY/ property $id=cib-bootstrap-options \ dc-version=1.1.10-14.el6-368c726 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ stonith-action=reboot \ no-quorum-policy=ignore \ last-lrm-refresh=1413378119 rsc_defaults $id=rsc-options \ resource-stickiness=100 \ migration-threshold=5 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Operation attribute change leads to resource restart
- Original Message - Hi! Just noticed that deletion of a trace_ra op attribute forces resource to be restarted (that RA does not support reload). Logs show: Nov 13 09:06:05 [6633] node01cib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to master (origin=local/cibadmin/2) Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: Diff: --- 0.641.96 2 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: Diff: +++ 0.643.0 98ecbda94c7e87250cf2262bf89f43e8 Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: -- /cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes'] Nov 13 09:06:05 [6633] node01cib: info: cib_perform_op: + /cib: @epoch=643, @num_updates=0 Nov 13 09:06:05 [6633] node01cib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=node01/cibadmin/2, version=0.643.0) Nov 13 09:06:05 [6638] node01 crmd: info: abort_transition_graph: Transition aborted by deletion of instance_attributes[@id='test-instance-start-0-instance_attributes']: Non-status change (cib=0.643.0, source=te_update_diff:383, path=/cib/configuration/resources/clone[@id='cl-test-instance']/primitive[@id='test-instance']/operations/op[@id='test-instance-start-0']/instance_attributes[@id='test-instance-start-0-instance_attributes'], 1) Nov 13 09:06:05 [6638] node01 crmd: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph] Nov 13 09:06:05 [6634] node01 stonith-ng: info: xml_apply_patchset: v2 digest mis-match: expected 98ecbda94c7e87250cf2262bf89f43e8, calculated 0b344571f3e1bb852e3d10ca23183688 Nov 13 09:06:05 [6634] node01 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-206) ... Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition: params:reload parameters boot_directory=/var/lib/libvirt/boot config_uri=http://192.168.168.10:8080/cgi-bin/manage_config.cgi?action=%aamp;resource=%namp;instance=%i; start_vm=1 vlan_id_start=2 per_vlan_ip_prefix_len=24 base_img=http://192.168.168.10:8080/pre45-mguard-virt.x86_64.default.qcow2; pool_name=default outer_phy=eth0 ip_range_prefix=10.101.0.0/16/ Nov 13 09:06:05 [6637] node01pengine: info: check_action_definition: Parameters to test-instance:0_start_0 on rnode001 changed: was 6f9eb6bd1f87a2b9b542c31cf1b9c57e vs. now 02256597297dbb42aadc55d8d94e8c7f (reload:3.0.9) 0:0;41:3:0:95e66b6a-a190-4e61-83a7-47165fb0105d ... Nov 13 09:06:05 [6637] node01pengine: notice: LogActions: Restart test-instance:0 (Started rnode001) That is not what I'd expect to see. Any time an instance attribute is changed for a resource, the resource is restarted/reloaded. This is expected. -- Vossel Is it intended or just a minor bug(s)? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] resource-discovery question
- Original Message - 12.11.2014 22:57, David Vossel wrote: - Original Message - 12.11.2014 22:04, Vladislav Bogdanov wrote: Hi David, all, I'm trying to get resource-discovery=never working with cd7c9ab, but still get Not installed probe failures from nodes which does not have corresponding resource agents installed. The only difference in my location constraints comparing to what is committed in #589 is that they are rule-based (to match #kind). Is that supposed to work with the current master or still TBD? Yep, after I modified constraint to a rule-less syntax, it works: ahh, good catch. I'll take a look! rsc_location id=vlan003-on-cluster-nodes rsc=vlan003 score=-INFINITY node=rnode001 resource-discovery=never/ But I'd prefer to that killer feature to work with rules too :) Although resource-discovery=exclusive with score 0 for multiple nodes should probably also work for me, correct? yep it should. I cannot test that on a cluster with one cluster node and one remote node. this feature should work the same with remote nodes and cluster nodes. I'll get a patch out for the rule issue. I'm also pushing out some documentation for the resource-discovery option. It seems like you've got a good handle on it already though :) Oh, I see new pull-request, thank you very much! One side question: Is default value for clone-max influenced by resource-discovery value(s)? kind of. with 'exclusive' if the number of nodes in the exclusive set is smaller than clone-max, clone-max is effectively reduced to the node count in the exclusive set. 'never' and 'always' do not directly influence resource placement, only 'exclusive' My location constraints look like: rsc_location id=vlan003-on-cluster-nodes rsc=vlan003 resource-discovery=never rule score=-INFINITY id=vlan003-on-cluster-nodes-rule expression attribute=#kind operation=ne value=cluster id=vlan003-on-cluster-nodes-rule-expression/ /rule /rsc_location Do I miss something? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] resource-discovery question
- Original Message - 12.11.2014 22:04, Vladislav Bogdanov wrote: Hi David, all, I'm trying to get resource-discovery=never working with cd7c9ab, but still get Not installed probe failures from nodes which does not have corresponding resource agents installed. The only difference in my location constraints comparing to what is committed in #589 is that they are rule-based (to match #kind). Is that supposed to work with the current master or still TBD? Yep, after I modified constraint to a rule-less syntax, it works: ahh, good catch. I'll take a look! rsc_location id=vlan003-on-cluster-nodes rsc=vlan003 score=-INFINITY node=rnode001 resource-discovery=never/ But I'd prefer to that killer feature to work with rules too :) Although resource-discovery=exclusive with score 0 for multiple nodes should probably also work for me, correct? yep it should. I cannot test that on a cluster with one cluster node and one remote node. this feature should work the same with remote nodes and cluster nodes. I'll get a patch out for the rule issue. I'm also pushing out some documentation for the resource-discovery option. It seems like you've got a good handle on it already though :) My location constraints look like: rsc_location id=vlan003-on-cluster-nodes rsc=vlan003 resource-discovery=never rule score=-INFINITY id=vlan003-on-cluster-nodes-rule expression attribute=#kind operation=ne value=cluster id=vlan003-on-cluster-nodes-rule-expression/ /rule /rsc_location Do I miss something? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] #kind eq container matches bare-metal nodes
- Original Message - 23.10.2014 22:39, David Vossel wrote: - Original Message - 21.10.2014 06:25, Vladislav Bogdanov wrote: 21.10.2014 05:15, Andrew Beekhof wrote: On 20 Oct 2014, at 8:52 pm, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, David, all, It seems like #kind was introduced before bare-metal remote node support, and now it is matched against cluster and container. Bare-metal remote nodes match container (they are remote), but strictly speaking they are not containers. Could/should that attribute be extended to the bare-metal use case? Unclear, the intent was 'nodes that aren't really cluster nodes'. Whats the usecase for wanting to tell them apart? (I can think of some, just want to hear yours) I want VM resources to be placed only on bare-metal remote nodes. -inf: #kind ne container looks a little bit strange. #kind ne remote would be more descriptive (having now them listed in CIB with 'remote' type). One more case (which is what I'd like to use in the mid-future) is a mixed remote-node environment, where VMs run on bare-metal remote nodes using storage from cluster nodes (f.e. sheepdog), and some of that VMs are whitebox containers themselves (they run services controlled by pacemaker via pacemaker_remoted). Having constraint '-inf: #kind ne container' is not enough to not try to run VMs inside of VMs - both bare-metal remote nodes and whitebox containers match 'container'. remember, you can't run remote-nodes nested within remote-nodes... so container nodes on baremetal remote-nodes won't work. Good to know, thanks. That imho should go into the documentation in bold red :) yep, I'm seeing that now. Is that a conceptual limitation or it is just not yet supported? I'm not sure yet. Nested pacemaker_remote is complex. Perhaps I'll find a clever way of doing it at some point in the future. Right now all my solutions are too complex to be useful, which is why the limitation exists. -- David You don't have to be careful about not messing this up or anything. You can mix container nodes and baremetal remote-nodes and everything should work fine. The policy engine will never allow you to place a container node on a baremetal remote-node though. -- David ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker-remote with KVM - start timeout not working
- Original Message - Hi! I guess it would be better to start a separate thread on this. I have a VM with pacemaker-remote installed. Stack: cman Current DC: wings1 - partition with quorum Version: 1.1.10-14.el6-368c726 3 Nodes configured 2 Resources configured Online: [ oracle-test:vm-oracle-test wings1 wings2 ] The remote-node in this case is named 'oracle-test'. The remote-node's container resource is 'vm-oracle-test'. Internally pacemaker makes a connection resource named after the remote-node. That resource represents the pacemaker_remote connection. Kind of confusing I know. Here's the point. The connection resource 'oracle-test' is what is timing out here, not the vm itself. By default the connection resource has a 60 second timeout. If you want to increase that timeout use the remote-connect-timeout resource metadata option. You don't have to fully understand how all this works, just know that the remote-connection-timeout option needs to be greater than the time it takes for the virtual machine to fully initialize. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options Hope that helps! -- Vossel vm-oracle-test (ocf::heartbeat:VirtualDomain): Started wings2 2 resources configured... However, # pcs resource show vm-oracle-test (ocf::heartbeat:VirtualDomain): Started As I understand, pacemaker considered pacemaker-remote on the VM as some sort of 'virtual resource' (called 'oracle-test' in my case), since I have only one 'primitive' section (VirtualDomain) in my CIB. Well, the problem is here: Sep 15 12:28:13 wings1 crmd[13553]: error: process_lrm_event: LRM operation oracle-test_start_0 (8397) Timed Out (timeout=6ms) Sep 15 12:28:13 wings1 crmd[13553]: warning: status_from_rc: Action 7 (oracle-test_start_0) on wings1 failed (target: 0 vs. rc: 1): Error Sep 15 12:28:13 wings1 crmd[13553]: warning: update_failcount: Updating failcount for oracle-test on wings1 after failed start: rc=1 (update=INFINITY, time=1 410769693) Timeout is 60 seconds! Even though I have: primitive class=ocf id=vm-oracle-test provider=heartbeat type=VirtualDomain instance_attributes id=vm-oracle-test-instance_attributes nvpair id=vm-oracle-test-instance_attributes-hypervisor name=hypervisor value=qemu:///system/ nvpair id=vm-oracle-test-instance_attributes-config name=config value=/etc/libvirt/qemu/oracle-test.xml/ /instance_attributes operations op id=vm-oracle-test-monitor-interval-60s interval=60s name=monitor/ op id=vm-oracle-test-start-timeout-300s-interval-0s-on-fail-restart interval=0s name=start on-fail=restart timeout=300s/ op id=vm-oracle-test-stop-timeout-60s-interval-0s-on-fail-block interval=0s name=stop on-fail=block timeout=60s/ /operations Moreover, VirtualDomain RA has this: actions action name=start timeout=90 / action name=stop timeout=90 / action name=status depth=0 timeout=30 interval=10 / action name=monitor depth=0 timeout=30 interval=10 / action name=migrate_from timeout=60 / action name=migrate_to timeout=120 / My VM is unable to start in 60 seconds. What could be done here? -- Best regards, Alexandr A. Alexandrov ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Notification when a node is down
- Original Message - Hi, Is there any way for a Pacemaker/Corosync/PCS setup to send a notification when it detects that a node in a cluster is down? I read that Pacemaker and Corosync logs events to syslog, but where is the syslog file in CentOS? Do they log events such as a failover occurrence? This might be a useful reference. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm207039249856 -- Vossel Thanks. -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote location constraint
- Original Message - Is it possible to put a location constraint on a resource such that it will only run on a pacemaker-remote node? Or vice-versa, so that a resource will not run on a pacemaker-remote node? At a glance, this doesn't seem possible as the pacemaker-remote node does not exist as a node entry in the CIB, so there's nothing to match on. Is it possible to match on the absence of that node entry? The only other way I can think of doing this is to set utilization attributes, such that the remote nodes provide a remote utilization attribute, and configure the resource such that it needs 1 unit of remote. There is definitely a way to do this already. I can't remember how though. Andrew, I know we discussed this a few months ago and you immediately rattled off how we allow this. Do you remember? -- Vossel -Patrick ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote container as a clone resource
- Original Message - I'm interested in creating a resource that will control host containers running pacemaker-remote. The catch is that I want this resource to be a clone, so that if I want more containers, I simply increase the `clone-max` property. The problem is that the `remote-node` meta parameter is set on the clone resource, and not the children. So I can't see a way to tell pacemaker the address of all the clone children containers that get created. The only way I can see something like this being possible is if the resource agent set an attribute on the clone child when the child was started. Is there any way to accomplish this? Or will I have to create separate resources for every single container? If this isn't possible, would this be considered as a future feature? I've been keeping up with this thread and just wanted to give my thoughts. First, lets forget the clone part initially. Clones and pacemaker_remote don't mix. I'm not convinced it is advantageous for us to take that conversation much further right now. Perhaps after start testing this next part we can come back to the cloned remote-node discussion. The interesting part here is that you want to define a remote-node that doesn't have an address. I see this as being highly beneficial. For instance, you want to make a docker container a remote-node, but docker assigns a random IP to the container every time it starts... Right now there'd be no (feasible) way to make the docker container a remote-node because there's no static IP associated with the container. I think you are dead on about how this should be done. We need a way for the container's resource-agent to update pacemaker via an attribute the address for the container after it has started. From there pacemaker uses that address when attempting to connect to the container's pacemaker_remote instance. I've made a issue in the Red Hat bug tracker related to this so I'll remember to do it. https://bugzilla.redhat.com/show_bug.cgi?id=1139843 -- Vossel Thanks -Patrick ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Libqb v0.17.1 release candidate
Hey, I'm gearing up for a new libqb release. https://github.com/ClusterLabs/libqb/releases/tag/v0.17.1-rc1 If you weren't aware, libqb is the library used for ipc, logging, and event loops in Pacemaker/Corosync. This release is to address the bug fixes that have been submitted since the v0.17.0 release. If you have any patches you want to get into the next release let me know as soon as possible. Assuming we don't discover any issues during release candidate testing, rc1 will become the 0.17.1 release mid next week. Thanks, -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources
, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-15.bz2): Complete Aug 12 11:28:14 ti14-demo1 crmd[3147]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] On Monday, August 11, 2014 4:06 PM, David Vossel dvos...@redhat.com wrote: - Original Message - Greetings, We are using pacemaker and cman in a two-node cluster with no-quorum-policy: ignore and stonith-enabled: false on a Centos 6 system (pacemaker related RPM versions are listed below). We are seeing some bizarre (to us) behavior when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have: 1) Fail a resource named some-resource started with the ocf:heartbeat:anything script (or others) on node01 (in our case, it's a master/slave resource we're pulling observations from, but it can happen on normal ones). 2) Wait for Resource to recover. 3) Fail node02 (reboot -nf, or power loss) 4) When node02 recovers, we see in /var/log/messages: - Quorum is recovered - Sending flush op to all hosts for master-some-resource, last-failure-some-resource, probe_complete(true), fail-count-some-resource(1) - pengine Processing failed op monitor for some-resource on node01: unknown error (1) * After adding a simple `date` called with $@ /tmp/log.rsc, we do not see the resource agent being called at this time, on either node. * Sometimes, we see other operations happen that are also not being sent to the RA, including stop/start * The resource is actually happilly running on node01 throughtout this whole process, so there's no reason we should be seeing this failure here. * This issue is only seen on resources that had not yet been cleaned up. Resources that were 'clean' when both nodes were last online do not have this issue. We noticed this originally because we are using the ClusterMon RA to report on different types of errors, and this is giving us false positives. Any thoughts on configuration issues we could be having, or if this sounds like a bug in pacemaker somewhere? This is likely a bug in whatever resource-agent you are using. There's no way for us to know for sure without logs. -- Vossel Thanks! Versions: ccs-0.16.2-69.el6_5.1.x86_64 clusterlib-3.0.12.1-59.el6_5.2.x86_64 cman-3.0.12.1-59.el6_5.2.x86_64 corosync-1.4.1-17.el6_5.1.x86_64 corosynclib-1.4.1-17.el6_5.1.x86_64 fence-virt-0.2.3-15.el6.x86_64 libqb-0.16.0-2.el6.x86_64 modcluster-0.16.2-28.el6.x86_64 openais-1.1.1-7.el6.x86_64 openaislib-1.1.1-7.el6.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 pacemaker-cli-1.1.10-14.el6_5.3.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-libs-1.1.10-14.el6_5.3.x86_64 pcs-0.9.90-2.el6.centos.3.noarch resource-agents-3.9.2-40.el6_5.7.x86_64 ricci-0.16.2-69.el6_5.1.x86_64 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Configuration recommandations for (very?) large cluster
- Original Message - On 12/08/14 07:52, Andrew Beekhof wrote: On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute cedric.duf...@idiap.ch wrote: Hello, Thanks to Pacemaker 1.1.12, I have been able to setup a (very?) large cluster: Thats certainly up there as one of the biggest :) Well, actually, I sized it down from 444 to 277 resources by merging 'VirtualDomain' and 'MailTo' RA/primitives into a custom/single 'LibvirtQemu' one. CIB is now ~3MiB uncompressed / ~100kiB compressed. (also avoids the informational-only 'MailTo' RA to come burden the cluster) 'PCMK_ipc_buffer' at 2MiB might be overkill now... but I'd rather stay on the safe side. Q: Are there adverse effects in keeping 'PCMK_ipc_buffer' high? More system memory will be required for ipc connections. Unless you're running low on ram, you should be fine with the buffer you set. 277 resources are: - 22 (cloned) network-health (ping) resources - 88 (cloned) stonith resources (I have 4 stonith devices) - 167 LibvirtQemu resources (83 general-purpose servers and 84 SGE-driven computation nodes) (and more LibvirtQemu resources are expected to come) Have you checked pacemaker's CPU usage during startup/failover? I'd be interested in your results. I finally set 'batch-limit' set to 22 - the quantity of nodes - as it makes sense when enabling a new primitive, as all monitor operations get dispatched immediately to all nodes at once. When bringing a standby node to life: - On the waking node (E5-2690v2): 167+5 resources monitoring operations get dispatched; the CPU load of the 'cib' process remains below 100% as the operations are executed, batched by 22 (though one can not see that batching, the monitoring operations succeeding very quickly), and complete in ~2 seconds. With Pacemaker 1.1.7, the 'cib' load would have peaked to 100% even before the first monitoring operation started (because of the CIB refresh, I guess) and would remain so for several tens of seconds (often resulting in timeouts and monitoring operations failure) - On the DC node (E5-2690v2): the CPU would also remain below 100%, alternating between the 'cib', 'pengine' and 'crmd' process. The DC is back to IDLE within ~4 seconds. I tried raising the 'batch-limit' to 50 and witnessed CPU load peaking at 100% while carrying out the same procedure, but all went well nonetheless. While I still had the ~450 resources, I also accidentally brought all 22 nodes back to life together (well, actually started the DC alone and then started the remaining 21 nodes together). As could be expected, the DC got quite busy (dispatching/executing the ~450*22 monitoring operations on all nodes). It took 40 minutes for the cluster to stabilize. But it did stabilize, with no timeout and not monitor operations failure! A few high CIB load detected / throttle down mode messages popped up but all went well. Q: Is there a way to favorize more powerful nodes for the DC (iow. push the DC election process in a preferred direction) ? Last updated: Mon Aug 11 13:40:14 2014 Last change: Mon Aug 11 13:37:55 2014 Stack: classic openais (with plugin) I would at least try running it with corosync 2.x (no plugin) That will use CPG for messaging which should perform even better. I'm running into a deadline now and will have to stick to 1.4.x for the moment. But as soon as I can free an old test Intel modular chassis I have around, I'll try backporting Coro 2.x from Debian/Experimental to Debian/Wheezy and see what gives. Current DC: bc1hx5a05 - partition with quorum Version: 1.1.12-561c4cf 22 Nodes configured, 22 expected votes 444 Resources configured PS: 'corosync' (1.4.7) traffic goes through a 10GbE network, with strict QoS priority over all other traffic. Are there recommended configuration tweaks I should not miss in such situation? So far, I have: - Raised the 'PCMK_ipc_buffer' size to 2MiB - Lowered the 'batch-limit' to 10 (though I believe my setup could sustain the default 30) Yep, definitely worth trying the higher value. We _should_ automatically start throttling ourselves if things get too intense. Yep. As mentioned above, I did see high CIB load detected / throttle down mode messages popup. Is this what you think about? Other than that, I would be making sure all the corosync.conf timeouts and other settings are appropriate. Never paid much attention to it so far. But it seems to me the Debian defaults are quite conservative, especially more so given my 10GbE (~0.2ms latency) interconnect and the care I took in prioritizing Corosync traffic (thanks to switches QoS/GMB and Linux 'tc'): token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 3600 vsftype: none max_messages: 20 secauth: off amf: disabled Am I right? PS: this work is being done within
Re: [Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources
- Original Message - Greetings, We are using pacemaker and cman in a two-node cluster with no-quorum-policy: ignore and stonith-enabled: false on a Centos 6 system (pacemaker related RPM versions are listed below). We are seeing some bizarre (to us) behavior when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have: 1) Fail a resource named some-resource started with the ocf:heartbeat:anything script (or others) on node01 (in our case, it's a master/slave resource we're pulling observations from, but it can happen on normal ones). 2) Wait for Resource to recover. 3) Fail node02 (reboot -nf, or power loss) 4) When node02 recovers, we see in /var/log/messages: - Quorum is recovered - Sending flush op to all hosts for master-some-resource, last-failure-some-resource, probe_complete(true), fail-count-some-resource(1) - pengine Processing failed op monitor for some-resource on node01: unknown error (1) * After adding a simple `date` called with $@ /tmp/log.rsc, we do not see the resource agent being called at this time, on either node. * Sometimes, we see other operations happen that are also not being sent to the RA, including stop/start * The resource is actually happilly running on node01 throughtout this whole process, so there's no reason we should be seeing this failure here. * This issue is only seen on resources that had not yet been cleaned up. Resources that were 'clean' when both nodes were last online do not have this issue. We noticed this originally because we are using the ClusterMon RA to report on different types of errors, and this is giving us false positives. Any thoughts on configuration issues we could be having, or if this sounds like a bug in pacemaker somewhere? This is likely a bug in whatever resource-agent you are using. There's no way for us to know for sure without logs. -- Vossel Thanks! Versions: ccs-0.16.2-69.el6_5.1.x86_64 clusterlib-3.0.12.1-59.el6_5.2.x86_64 cman-3.0.12.1-59.el6_5.2.x86_64 corosync-1.4.1-17.el6_5.1.x86_64 corosynclib-1.4.1-17.el6_5.1.x86_64 fence-virt-0.2.3-15.el6.x86_64 libqb-0.16.0-2.el6.x86_64 modcluster-0.16.2-28.el6.x86_64 openais-1.1.1-7.el6.x86_64 openaislib-1.1.1-7.el6.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 pacemaker-cli-1.1.10-14.el6_5.3.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-libs-1.1.10-14.el6_5.3.x86_64 pcs-0.9.90-2.el6.centos.3.noarch resource-agents-3.9.2-40.el6_5.7.x86_64 ricci-0.16.2-69.el6_5.1.x86_64 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Why location can't work as expected
- Original Message - Hello everyone: my tools version pacemaker: 1.1.10 corosync: 1.4.5 crmsh-2.0 I have 2 nodes node1 and node2, resource agent Test must running on node1, and Test should not run on node2 if node1 is offline. So I do the following config: location TestOnNode1 Test INFINITY: node1 If node1 and node2 are both online, Test running on node1. But if node1 is offline, the resource agent Test will be switched to node2. I think that doesn't obey my config. My question: Is that pacemaker's feature? or I have missing some config? I find the following config can work as expected: location TestOnNode1 Test -INFINITY: node2 But I think that is not direct . Yes, this is expected behavior. Take a look at the symmetric-cluster option and learn about opt-in clusters here. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_asymmetrical_opt_in_clusters -- Vossel thanks -- 宝存科技 david ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] NFS concurrency
- Original Message - Hi, I’m working with SLES 11 SP3 and pacemaker 1.1.10-0.15.25 I’m looking to define constraints that can allow multiple NFSV4 filesystem/exports to be started concurrently (they belong to the same LVM). I also have multiple access points. My model looks like this: | FS1 -- Exportfs1 | |FS2 ---Exportfs2 | |FSn ---Exportfsn | rootfs---LVM1-- |---IP1 | |---IPn Essentially, rootfs starts first and then the LVM. After the LVM, I’m looking to start the filesystems and IPs next. Once each filesystem is started then the export that belongs to it. I’d like to do this since I’m using NFSV4 and the Gracetime and leasetime (I have set to 10 seconds) cause increasingly long stop times on failover. you don't have to observe the lease time during stop if you stop the nfs server first before doing the fs umount. I used groups to do something similar to this in my NFS active-active scenario and I don't wait for lease time during stop. After failover the scenario will wait the grace period on the node the export moved to. The grace period should be = than the lease time on the node the export moved from. presentation: https://github.com/davidvossel/phd/raw/master/doc/presentations/nfs-aa-overview.pdf Sample scenario: https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario -- Vossel I managed to define the above using individual colocations and order constraints, but was wondering if there was a more concise definition that would work. My system may support many LVMs and many shares/exports per LVM, manageability may get out of control. My constraints look like this: colocation c2 inf: ( fs1 fs2 ip1 ) lvm1 colocation c3 inf: fs1 exportfs1 colocation c4 inf: fs2 exportfs2 order NFS-order1 inf: lvm1 fs1 exportfs1 order NFS-order2 inf: lvm1 fs2 exportfs2 order NFS-order3 inf: exportfs lvm1:start order NFS-order4 inf: lvm1 ip1 The XML version is rsc_colocation id=c2 score=INFINITY resource_set id=c2-0 sequential=false resource_ref id=fs1/ resource_ref id=fs2/ resource_ref id=ip1/ /resource_set resource_set id=c2-1 resource_ref id=lvm1/ /resource_set /rsc_colocation rsc_colocation id=c3 score=INFINITY rsc=fs1 with-rsc=exportfs1/ rsc_colocation id=c4 score=INFINITY rsc=fs2 with-rsc=exportfs2/ rsc_order id=nfssrv_order score=INFINITY first=nfsserver then=root_exportfs/ rsc_order id=NFS-order3 score=INFINITY first=root_exportfs then=lvm1 then-action=start/ rsc_order id=NFS-order1 score=INFINITY resource_set id=NFS-order1-0 resource_ref id=lvm1/ resource_ref id=fs1/ resource_ref id=exportfs1/ /resource_set /rsc_order rsc_order id=NFS-order2 score=INFINITY resource_set id=NFS-order2-0 resource_ref id=lvm1/ resource_ref id=fs2/ resource_ref id=exportfs2/ /resource_set /rsc_order rsc_order id=NFS-order4 score=INFINITY first=lvm1 then=ip1/ Thanks, Diane Schaefer ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Announcing 1.1.12 - Final
- Original Message - I am pleased to report that 1.1.12 is finally done. This is a really great release and includes three key improvements: - ACLs are now on by default - pacemaker-remote now works for bare-metal nodes - Thanks to a new algorithm, the CIB is now two orders of magnitude faster. Great work Andrew, those CIB improvements are insane! Pacemaker is about to find its way into some deployments we never thought were possible. This means less CPU usage by the cluster itself and faster failover times. I will be building for Fedora shortly, those on other distros (and the impatient) can build their own rpm packages with the instructions below. 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 2. Install dependancies (if you haven't already) [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 3. Build Pacemaker # make release 4. Copy and deploy as needed Some stats for this release: - Changesets: 795 - Diff: 195 files changed, 13772 insertions(+), 6176 deletions(-) Features added since Pacemaker-1.1.11 • Changes to the ACL schema to support nodes and unix groups • cib: Check ACLs prior to making the update instead of parsing the diff afterwards • cib: Default ACL support to on • cib: Enable the more efficient xml patchset format • cib: Implement zero-copy status update • cib: Send all r/w operations via the cluster connection and have all nodes process them • crmd: Set cluster-name property to corosync's cluster_name by default for corosync-2 • crm_mon: Display brief output if -b/--brief is supplied or 'b' is toggled • crm_report: Allow ssh alternatives to be used • crm_ticket: Support multiple modifications for a ticket in an atomic operation • extra: Add logrotate configuration file for /var/log/pacemaker.log • Fencing: Add the ability to call stonith_api_time() from stonith_admin • logging: daemons always get a log file, unless explicitly set to configured 'none' • logging: allows the user to specify a log level that is output to syslog • PE: Automatically re-unfence a node if the fencing device definition changes • pengine: cl#5174 - Allow resource sets and templates for location constraints • pengine: Support cib object tags • pengine: Support cluster-specific instance attributes based on rules • pengine: Support id-ref in nvpair with optional name • pengine: Support per-resource maintenance mode • pengine: Support site-specific instance attributes based on rules • tools: Allow crm_shadow to create older configuration versions • tools: Display pending state in crm_mon/crm_resource/crm_simulate if --pending/-j is supplied (cl#5178) • xml: Add the ability to have lightweight schema revisions • xml: Enable resource sets in location constraints for 1.2 schema • xml: Support resources that require unfencing You can get the full details at https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.12 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node when other node brought out of standby
- Original Message - From: Ian cl-3...@jusme.com To: Clusterlabs (pacemaker) mailing list pacemaker@oss.clusterlabs.org Sent: Monday, May 12, 2014 3:02:50 PM Subject: [Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node when other node brought out of standby Hi, First message here and pretty new to this, so apologies if this is the wrong place/approach for this question. I'm struggling to describe this problem so searching for previous answers is tricky. Hopefully someone can give me a pointer... does setting resource-stickiness help? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options -- Vossel Brief summary: -- Situation is: dual node cluster (CentOS 6.5), running drbd+gfs2 to provide active/active filestore for a libvirt domain (vm). With both nodes up, all is fine, active/active filesystem available on both nodes, vm running on node 1 Place node 2 into standby, vm is unaffected. Good. Bring node 2 back online, pacemaker chooses to stop the vm and gfs on node 1 while it promotes drbd to master on node 2. Bad (not very HA!) Hopefully I've just got a constraint missing/wrong (or the whole structure!). I know there is a constraint linking the promotion of the drbd resource to the starting of the gfs2 filesystem, but I wouldn't expect this to trigger on node 1 in the above scenario as it's already promoted? Versions: - Linux sv07 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25 21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux pacemaker-libs-1.1.10-14.el6_5.2.x86_64 pacemaker-cli-1.1.10-14.el6_5.2.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64 pacemaker-1.1.10-14.el6_5.2.x86_64 Configuration (abridged): - (I can provide full configs/logs if it isn't obvious and anyone cares to dig deeper) res_drbd_vm1 is the drbd resource vm_storage_core_dev is a group containing the drbd resources (just res_drbd_vm1 in this config) vm_storage_core_dev-master is the master/slave resource for drbd res_fs_vm1 is the gfs2 filesystem resource vm_storage_core is a group containing the gfs2 resources (just res_fs_vm1 in this config) vm_storage_core-clone is the clone resource to get the gfs2 filesystem active on both nodes res_vm_nfs_server is the libvirt domain (vm) (NB, The nfs filestore this server is sharing isn't from the gfs2 filesystem, but another drbd volume that is always active/passive) # pcs resource show --full Master: vm_storage_core_dev-master Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Group: vm_storage_core_dev Resource: res_drbd_vm1 (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=vm1 Operations: monitor interval=60s (res_drbd_vm1-monitor-interval-60s) Clone: vm_storage_core-clone Group: vm_storage_core Resource: res_fs_vm1 (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd/by-res/vm1 directory=/data/vm1 fstype=gfs2 options=noatime,nodiratime Operations: monitor interval=60s (res_fs_vm1-monitor-interval-60s) Resource: res_vm_nfs_server (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/vm09.xml Operations: monitor interval=60s (res_vm_nfs_server-monitor-interval-60s) # pcs constraint show Ordering Constraints: promote vm_storage_core_dev-master then start vm_storage_core-clone start vm_storage_core-clone then start res_vm_nfs_server Colocation Constraints: vm_storage_core-clone with vm_storage_core_dev-master (rsc-role:Started) (with-rsc-role:Master) res_vm_nfs_server with vm_storage_core-clone /var/log/messages on node 2 around the event: - May 12 19:23:02 sv07 attrd[3156]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-res_drbd_vm1 (1) May 12 19:23:02 sv07 attrd[3156]: notice: attrd_perform_update: Sent update 1067: master-res_drbd_vm1=1 May 12 19:23:02 sv07 crmd[3158]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] May 12 19:23:02 sv07 attrd[3156]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-res_drbd_live (1) May 12 19:23:02 sv07 attrd[3156]: notice: attrd_perform_update: Sent update 1070: master-res_drbd_live=1 May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Promote res_drbd_vm1:0#011(Slave - Master sv07) May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restart res_fs_vm1:0#011(Started sv06) May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Start res_fs_vm1:1#011(sv07) May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restart res_vm_nfs_server#011(Started sv06) May 12 19:23:02 sv07
Re: [Pacemaker] pacemaker unmanage resource
- Original Message - From: emmanuel segura emi2f...@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Friday, May 9, 2014 12:44:10 PM Subject: Re: [Pacemaker] pacemaker unmanage resource I found the monitor operation remain active if the resource is in unmanaged state, my stupid question, what is the different between monitor and umanaged? sorry i looking around and i don't find any document about this. monitor = checking to see if a resource is active or not. unmanaged resource = a resource pacemaker will only perform status checks (monitors) on. These resources will not be stopped/started by pacemaker. -- Vossel 2014-05-09 17:13 GMT+02:00 emmanuel segura emi2f...@gmail.com : Hello List, I would like to know if it's normal, that pacemaker does the monitor action while the resource is in unmanage state. yes Thanks -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1.12 - Release Candidate 1
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, May 7, 2014 1:31:27 AM Subject: [Pacemaker] Pacemaker 1.1.12 - Release Candidate 1 As promised, this announcement brings the first release candidate for Pacemaker 1.1.12 https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.12-rc1 This release primarily focuses on important but mostly invisible changes under-the-hood: - The CIB is now O(2) faster. Thats 100x for those not familiar with Big-O notation :-) This has massively reduced the cluster's use of system resources, allowing us to scale further on the same hardware, and dramatically reduced failover times for large clusters. - Support for ACLs are is enabled by default. The new implementation can restrict cluster access for containers where pacemaker-remoted is used and is also more efficient. - All CIB updates are now serialized and pre-synchronized via the corosync CPG interface. This makes it impossible for updates to be lost, even when the cluster is electing a new DC. - Schema versioning changes New features are no longer silently added to the schema. Instead the ${Y} in pacemaker-${X}-${Y} will be incremented for simple additions, and ${X} will be bumped for removals or other changes requiring an XSL transformation. To take advantage of new features, you will need to updates all the nodes and then run the equivalent of `cibadmin --upgrade`. Thankyou to everyone that has tested out the new CIB and ACL code already. Please keep those bug reports coming in! Also, This release introduces permanent remote-node attributes. That feature was the last thing that functionally kept remote-nodes (nodes running pacemaker_remote) from behaving just like cluster-nodes. With these new CIB improvements pacemaker scales incredibly well. Couple those CIB changes with pacemaker's ability to manage remote-nodes and we now have the ability to scale clusters spanning hundreds possibly thousands of nodes. Exciting stuff. Thanks for everyone's hard work. This community is great! It's hard to believe how far pacemaker has come over the past few years. -- Vossel List of known bugs to be investigating during the RC phase: - 5206Fileencoding broken - 5194A resource starts with a standby node. (Latest attrd does not serve as the crmd-transition-delay parameter) - 5197Fail-over is delayed. (State transition is not calculated.) - 5139Each node fenced in its own transition during start-up fencing - 5200target node is over-utilized with allow-migrate=true - 5184Pending probe left in the cib - 5187-INFINITY colocation constraint not fully respected - 5165Add support for transient node utilization attributes To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. Install dependancies (if you haven't already) [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy and deploy as needed ## Details Changesets: 633 Diff: 184 files changed, 12690 insertions(+), 5843 deletions(-) ## Highlights ### Features added since Pacemaker-1.1.11 + Changes to the ACL schema to support nodes and unix groups + cib: Check ACLs prior to making the update instead of parsing the diff afterwards + cib: Default ACL support to on + cib: Enable the more efficient xml patchset format + cib: Implement zero-copy status update (performance) + cib: Send all r/w operations via the cluster connection and have all nodes process them + crm_mon: Display brief output if -b/--brief is supplied or 'b' is toggled + crm_ticket: Support multiple modifications for a ticket in an atomic operation + Fencing: Add the ability to call stonith_api_time() from stonith_admin + logging: daemons always get a log file, unless explicitly set to configured 'none' + PE: Automatically re-unfence a node if the fencing device definition changes + pengine: cl#5174 - Allow resource sets and templates for location constraints + pengine: Support cib object tags + pengine: Support cluster-specific instance attributes based on rules + pengine: Support id-ref in nvpair with optional name + pengine: Support per-resource maintenance mode + pengine: Support site-specific instance attributes based on rules + tools: Display pending state in crm_mon/crm_resource/crm_simulate if --pending/-j is supplied (cl#5178) + xml: Add the ability to have lightweight schema revisions + xml: Enable resource sets in location constraints for 1.2 schema + xml: Support resources that require unfencing See
Re: [Pacemaker] Feedback when crm appears to do nothing
- Original Message - From: Iain Buchanan iain...@gmail.com To: Pacemaker pacemaker@oss.clusterlabs.org Sent: Wednesday, April 23, 2014 2:59:48 AM Subject: [Pacemaker] Feedback when crm appears to do nothing Hi, I hope this is the right list for corosync/pacemaker questions - apologies if it is not. I'm running pacemaker 1.1.10 and corosync 2.3.0 under Ubuntu 12.04. Occasionally I send a command using crm such as crm resource move RESOURCE SERVER and absolutely nothing appears to happen. When this occurs is there a way of seeing why - was there a constraint that prevented the move etc.? There doesn't seem to be anything at the info level in the log. are you by any chance managing Upstart resources? If so you need to update to 1.1.11. There was a bug in 1.1.10 that caused the crmd to block indefinitely when upstart resources existed in the cib. It had to do with how the crmd communicated with the upstart daemon over dbus. -- Vossel Iain ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Best practice for quorum nodes
- Original Message - From: Andrew Martin amar...@xes-inc.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Friday, April 18, 2014 9:38:45 AM Subject: [Pacemaker] Best practice for quorum nodes Hello, I've read several guides about how to configure a 3-node cluster with one node that can't actually run the resources, but just serves as a quorum node. One practice for configuring this node is to put it in standby, which prevents it from running resources. In my experience, this seems to work pretty well, however from time-to-time I see these errors appear in my pacemaker logs: Preventing rsc from re-starting on host: operation monitor failed 'not installed' (rc=5) Is there a better way to designate a node as a quorum node, so that resources do not attempt to start or re-start on it? Perhaps a combination of setting it in standby mode and a resource constraint to prevent the resources from running on it? Or, is there a better way to set it up? Pacemaker is going to probe the node for active resources regardless. If you don't want to see these errors, make sure the resource-agents and resource-agent dependent packages are installed on all nodes in the cluster. -- Vossel Thanks, Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] RHEL/centos6 - pacemaker - checking value of PCMK_ipc_buffer
- Original Message - From: Nikola Ciprich nikola.cipr...@linuxbox.cz To: pacema...@clusterlabs.org Sent: Friday, April 18, 2014 12:47:35 AM Subject: [Pacemaker] RHEL/centos6 - pacemaker - checking value of PCMK_ipc_buffer Hello, I've hit internal limit of PCMK_ipc_buffer on one of my cluster. Unfortunately I'm using corosync + plugin configuration (which I know is discouraged, so I'll switch this production cluster to CMAN ASAP), however, I tried setting PCMK_ipc_buffer on my test cluster already running on CMAN + pacemaker by setting /etc/sysconfig/pacemaker and checking environ values of cib and other processes /proc/???/environ files and don't see variables set there.. Therefore my question is, can I somehow check the limits I set are really applied? So I can be sure I've set it correctly? Thanks a lot in advance for reply what version of libqb and pacemaker are you using? best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd does abort if a stopped node is specified
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, April 18, 2014 4:49:42 AM Subject: [Pacemaker] crmd does abort if a stopped node is specified Hi, crmd does abort if I load CIB which specified a stopped node. # crm_mon -1 Last updated: Fri Apr 18 11:51:36 2014 Last change: Fri Apr 18 11:51:30 2014 Stack: corosync Current DC: pm103 (3232261519) - partition WITHOUT quorum Version: 1.1.11-cf82673 1 Nodes configured 0 Resources configured Online: [ pm103 ] # cat test.cli node pm103 node pm104 # crm configure load update test.cli Apr 18 11:52:42 pm103 crmd[11672]:error: crm_int_helper: Characters left over after parsing 'pm104': 'pm104' Apr 18 11:52:42 pm103 crmd[11672]:error: crm_abort: crm_get_peer: Triggered fatal assert at membership.c:420 : id 0 || uname != NULL Apr 18 11:52:42 pm103 pacemakerd[11663]:error: child_waitpid: Managed process 11672 (crmd) dumped core (gdb) bt #0 0x0033da432925 in raise () from /lib64/libc.so.6 #1 0x0033da434105 in abort () from /lib64/libc.so.6 #2 0x7f30241b7027 in crm_abort (file=0x7f302440b0b3 membership.c, function=0x7f302440b5d0 crm_get_peer, line=420, assert_condition=0x7f302440b27e id 0 || uname != NULL, do_core=1, do_fork=0) at utils.c:1177 #3 0x7f30244048ee in crm_get_peer (id=0, uname=0x0) at membership.c:420 #4 0x7f3024402238 in crm_peer_uname (uuid=0x113e7c0 pm104) at is the uuid for your cluster nodes supposed to be the same as the uname? We're treating the uuid in this situation as if it should be a number, which it clearly is not. -- Vossel cluster.c:386 #5 0x0043afbd in abort_transition_graph (abort_priority=100, abort_action=tg_restart, abort_text=0x44d2f4 Non-status change, reason=0x113e4b0, fn=0x44df07 te_update_diff, line=382) at te_utils.c:518 #6 0x0043caa4 in te_update_diff (event=0x10f2240 cib_diff_notify, msg=0x1137660) at te_callbacks.c:382 #7 0x7f302461d1bc in cib_native_notify (data=0x10ef750, user_data=0x1137660) at cib_utils.c:733 #8 0x0033db83d6bc in g_list_foreach () from /lib64/libglib-2.0.so.0 #9 0x7f3024620191 in cib_native_dispatch_internal (buffer=0xe61ea8 notify t=\cib_notify\ subt=\cib_diff_notify\ cib_op=\cib_apply_diff\ cib_rc=\0\ cib_object_type=\diff\cib_generationgeneration_tuple epoch=\4\ num_updates=\0\ admin_epoch=\0\ validate-with=\pacem..., length=1708, userdata=0xe5eb90) at cib_native.c:123 #10 0x7f30241dee72 in mainloop_gio_callback (gio=0xf61ea0, condition=G_IO_IN, data=0xe601b0) at mainloop.c:639 #11 0x0033db83feb2 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #12 0x0033db843d68 in ?? () from /lib64/libglib-2.0.so.0 #13 0x0033db844275 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #14 0x00406469 in crmd_init () at main.c:154 #15 0x004062b0 in main (argc=1, argv=0x7fff908829f8) at main.c:121 Is this all right? Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs resource create with one script
- Original Message - From: Dvorak Andreas andreas.dvo...@baaderbank.de To: pacemaker@oss.clusterlabs.org Sent: Wednesday, April 16, 2014 12:36:14 PM Subject: [Pacemaker] pcs resource create with one script Dear all, I want to create a resource with an own script, but it does not work. pcs resource create MYSQLFS ocf:baader:MYSQLFS op monitor interval=30s Error: unable to locate command: /usr/lib/ocf/resource.d/baader/MYSQLFS Does the ocf script implement the meta-data function? I think pcs will throw errors if it can't retrieve the metadata. You could try the --force option. -- Vossel But the script does exit. ls -l /usr/lib/ocf/resource.d/baader/MYSQLFS -rwxr-xr-x 1 root root 2548 Apr 16 19:16 /usr/lib/ocf/resource.d/baader/MYSQLFS Can somebody please explain who to solve this problem? pcs status Cluster name: mysql-int-prod Last updated: Wed Apr 16 19:27:07 2014 Last change: Fri Dec 13 15:54:04 2013 via crmd on sv2828-p1 Stack: cman Current DC: sv2827-p1 - partition with quorum Version: 1.1.10-1.el6_4.4-368c726 4 Nodes configured 2 Resources configured Online: [ sv2827-p1 sv2828-p1 ] OFFLINE: [ sv2827 sv2828 ] Full list of resources: ipmi-fencing-sv2827 (stonith:fence_ipmilan): Started sv2827-p1 ipmi-fencing-sv2828 (stonith:fence_ipmilan): Started sv2828-p1 Best regards Andreas Dvorak ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] redhat 7 pacemaker compiled without acl
- Original Message - From: emmanuel segura emi2f...@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Friday, April 4, 2014 1:12:26 PM Subject: [Pacemaker] redhat 7 pacemaker compiled without acl Hello List, I trying to install a virtual cluster using redhat 7 beta, the first thing i noticed is this. ::: [root@localhost ~]# cibadmin -! Pacemaker 1.1.10-19.el7 (Build: 368c726): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc upstart systemd nagios corosync-native ::: Pacemaker was compiled without acl options? if the answer is yes, why? redhat doens't support pacemaker acl? There is a new pacemaker acl implementation underway right now upstream. The current pacemaker acl support was held out of the rhel7 build with the expectation of picking up the new implementation in a later rhel release. -- Vossel Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs and lsb resource
- Original Message - From: Dori Seliskar d...@delo.si To: pacemaker@oss.clusterlabs.org Sent: Friday, March 21, 2014 7:46:37 AM Subject: [Pacemaker] pcs and lsb resource Hi all, I'm trying to create a lsb resource with pcs on fedora 20 and I'm failing miserably every time: (have sucessfully created ocf resources though) # pcs resource create impulz_dosemu lsb:impulz op stop interval=0 timeout=60s monitor interval=60s timeout=5s start interval=0 timeout=60s Error: Unable to create resource 'lsb:impulz', it is not installed on this system (use --force to override) This could be a pcs bug. If using --force works, this is definitely a pcs bug. pcs is supposed to be looking through /etc/init.d/* on the local machine for valid lsb scripts before applying a lsb resource into the cluster config. Maybe it isn't matching right in the f20 pcs package. -- Vossel Manually starting/stoping resource from /etc/init.d works just fine and I tested the script for compatibility against http://www.linux-ha.org/wiki/LSB_Resource_Agents What am I missing here ? Thanks! Best regards, dori ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, March 18, 2014 12:30:01 AM Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 2014-03-18 8:03 GMT+09:00 David Vossel dvos...@redhat.com: - Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, March 17, 2014 4:51:11 AM Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 2014-03-17 16:37 GMT+09:00 Kazunori INOUE kazunori.ino...@gmail.com: 2014-03-15 4:08 GMT+09:00 David Vossel dvos...@redhat.com: - Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, March 14, 2014 5:52:38 AM Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 Hi, When specifying the node name in UPPER case and performing crm_resource, crmd was aborted. (The real node name is a LOWER case.) https://github.com/ClusterLabs/pacemaker/pull/462 does that fix it? Since behavior of glib is strange somehow, the result is NO. I tested this brunch. https://github.com/davidvossel/pacemaker/tree/lrm-segfault * Red Hat Enterprise Linux Server release 6.4 (Santiago) * glib2-2.22.5-7.el6.x86_64 strcase_equal() is not called from g_hash_table_lookup(). [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409 ...snip... (gdb) b lrm.c:1232 Breakpoint 1 at 0x4251d0: file lrm.c, line 1232. (gdb) b strcase_equal Breakpoint 2 at 0x429828: file lrm_state.c, line 95. (gdb) c Continuing. Breakpoint 1, do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) at lrm.c:1232 1232lrm_state = lrm_state_find(target_node); (gdb) s lrm_state_find (node_name=0x1d4c650 X3650H) at lrm_state.c:267 267 { (gdb) n 268 if (!node_name) { (gdb) n 271 return g_hash_table_lookup(lrm_state_table, node_name); (gdb) p g_hash_table_size(lrm_state_table) $1 = 1 (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))-data $2 = 0x1c791a0 x3650h (gdb) p node_name $3 = 0x1d4c650 X3650H (gdb) n 272 } (gdb) n do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) at lrm.c:1234 1234if (lrm_state == NULL is_remote_node) { (gdb) n 1240CRM_ASSERT(lrm_state != NULL); (gdb) n Program received signal SIGABRT, Aborted. 0x003787e328a5 in raise () from /lib64/libc.so.6 (gdb) I wonder why... so I will continue investigation. I read the code of g_hash_table_lookup(). Key is compared by the hash value generated by crm_str_hash before strcase_equal() is performed. good catch. I've updated the patch in this pull request. Can you give it a go? https://github.com/ClusterLabs/pacemaker/pull/462 fail-count is not cleared only in this. $ crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMd. OK $ grep fail-count /var/log/ha-log Mar 18 13:53:36 x3650g attrd[3610]:debug: attrd_client_message: Broadcasting fail-count-p1[X3650H] = (null) $ $ crm_mon -rf1 Last updated: Tue Mar 18 13:54:51 2014 Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h Stack: corosync Current DC: x3650h (3232261384) - partition with quorum Version: 1.1.10-83553fa 2 Nodes configured 1 Resources configured Online: [ x3650g x3650h ] Full list of resources: p1 (ocf::pacemaker:Dummy): Stopped Migration summary: * Node x3650h: p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18 13:53:19 2014' * Node x3650g: $ So this change also seems to be necessary. yep, added your patch to the pull request https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d I found another one in stonith that I fixed. https://github.com/ClusterLabs/pacemaker/pull/462 Are we good for merging this now? -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, March 17, 2014 4:51:11 AM Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 2014-03-17 16:37 GMT+09:00 Kazunori INOUE kazunori.ino...@gmail.com: 2014-03-15 4:08 GMT+09:00 David Vossel dvos...@redhat.com: - Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, March 14, 2014 5:52:38 AM Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 Hi, When specifying the node name in UPPER case and performing crm_resource, crmd was aborted. (The real node name is a LOWER case.) https://github.com/ClusterLabs/pacemaker/pull/462 does that fix it? Since behavior of glib is strange somehow, the result is NO. I tested this brunch. https://github.com/davidvossel/pacemaker/tree/lrm-segfault * Red Hat Enterprise Linux Server release 6.4 (Santiago) * glib2-2.22.5-7.el6.x86_64 strcase_equal() is not called from g_hash_table_lookup(). [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409 ...snip... (gdb) b lrm.c:1232 Breakpoint 1 at 0x4251d0: file lrm.c, line 1232. (gdb) b strcase_equal Breakpoint 2 at 0x429828: file lrm_state.c, line 95. (gdb) c Continuing. Breakpoint 1, do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) at lrm.c:1232 1232lrm_state = lrm_state_find(target_node); (gdb) s lrm_state_find (node_name=0x1d4c650 X3650H) at lrm_state.c:267 267 { (gdb) n 268 if (!node_name) { (gdb) n 271 return g_hash_table_lookup(lrm_state_table, node_name); (gdb) p g_hash_table_size(lrm_state_table) $1 = 1 (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))-data $2 = 0x1c791a0 x3650h (gdb) p node_name $3 = 0x1d4c650 X3650H (gdb) n 272 } (gdb) n do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) at lrm.c:1234 1234if (lrm_state == NULL is_remote_node) { (gdb) n 1240CRM_ASSERT(lrm_state != NULL); (gdb) n Program received signal SIGABRT, Aborted. 0x003787e328a5 in raise () from /lib64/libc.so.6 (gdb) I wonder why... so I will continue investigation. I read the code of g_hash_table_lookup(). Key is compared by the hash value generated by crm_str_hash before strcase_equal() is performed. good catch. I've updated the patch in this pull request. Can you give it a go? https://github.com/ClusterLabs/pacemaker/pull/462 *** This is quick-fix solution. *** crmd/lrm_state.c |4 ++-- include/crm/crm.h |2 ++ lib/common/utils.c | 11 +++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/crmd/lrm_state.c b/crmd/lrm_state.c index d20d74a..ae036fd 100644 --- a/crmd/lrm_state.c +++ b/crmd/lrm_state.c @@ -234,13 +234,13 @@ lrm_state_init_local(void) } lrm_state_table = -g_hash_table_new_full(crm_str_hash, strcase_equal, NULL, internal_lrm_state_destroy); +g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL, internal_lrm_state_destroy); if (!lrm_state_table) { return FALSE; } proxy_table = -g_hash_table_new_full(crm_str_hash, strcase_equal, NULL, remote_proxy_free); +g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL, remote_proxy_free); if (!proxy_table) { g_hash_table_destroy(lrm_state_table); return FALSE; diff --git a/include/crm/crm.h b/include/crm/crm.h index b763cc0..46fe5df 100644 --- a/include/crm/crm.h +++ b/include/crm/crm.h @@ -195,7 +195,9 @@ typedef GList *GListPtr; # include crm/error.h # define crm_str_hash g_str_hash_traditional +# define crm_str_hash2 g_str_hash_traditional2 guint g_str_hash_traditional(gconstpointer v); +guint g_str_hash_traditional2(gconstpointer v); #endif diff --git a/lib/common/utils.c b/lib/common/utils.c index 29d7965..50fa6c0 100644 --- a/lib/common/utils.c +++ b/lib/common/utils.c @@ -2368,6 +2368,17 @@ g_str_hash_traditional(gconstpointer v) return h; } +guint +g_str_hash_traditional2(gconstpointer v) +{ +const signed char *p; +guint32 h = 0; + +for (p = v; *p != '\0'; p++) +h = (h 5) - h + g_ascii_tolower(*p); + +return h; +} void * find_library_function(void **handle, const char *lib, const char *fn, gboolean fatal) # crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMdNo messages received in 60 seconds.. aborting Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort: do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state != NULL ...snip... Mar 14 18:33:10
Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, March 14, 2014 5:52:38 AM Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 Hi, When specifying the node name in UPPER case and performing crm_resource, crmd was aborted. (The real node name is a LOWER case.) https://github.com/ClusterLabs/pacemaker/pull/462 does that fix it? # crm_resource -C -r p1 -N X3650H Cleaning up p1 on X3650H Waiting for 1 replies from the CRMdNo messages received in 60 seconds.. aborting Mar 14 18:33:10 x3650h crmd[10718]:error: crm_abort: do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state != NULL ...snip... Mar 14 18:33:10 x3650h pacemakerd[10708]:error: child_waitpid: Managed process 10718 (crmd) dumped core * The state before performing crm_resource. Stack: corosync Current DC: x3650g (3232261383) - partition with quorum Version: 1.1.10-38c5972 2 Nodes configured 3 Resources configured Online: [ x3650g x3650h ] Full list of resources: f-g (stonith:external/ibmrsa-telnet): Started x3650h f-h (stonith:external/ibmrsa-telnet): Started x3650g p1 (ocf::pacemaker:Dummy): Stopped Migration summary: * Node x3650g: * Node x3650h: p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14 18:32:48 2014' Failed actions: p1_monitor_1 on x3650h 'not running' (7): call=16, status=complete, last-rc-change='Fri Mar 14 18:32:48 2014', queued=0ms, exec=0ms Just for reference, similar phenomenon did not occur by crm_standby. $ crm_standby -U X3650H -v on Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] drbd + lvm
- Original Message - From: Infoomatic infooma...@gmx.at To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 5:28:19 PM Subject: Re: [Pacemaker] drbd + lvm Has anyone had this issue and resolved it? Any ideas? Thanks in advance! Yep, i've hit this as well. Use the latest LVM agent. I already fixed all of this. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/LVM Keep your volume_list the way it is and use the 'exclusive=true' LVM option. This will allow the LVM agent to activate volumes that don't exist in the volume_list. Hope that helps Thanks for the fast response. I upgraded LVM to the backports (2.02.95-4ubuntu1.1~precise1) and used this script, but I am getting errors when one of the nodes tries to activate the VG. The log: Mar 13 23:21:03 lxc02 LVM[7235]: INFO: 0 logical volume(s) in volume group replicated now active Mar 13 23:21:03 lxc02 LVM[7235]: INFO: LVM Volume replicated is not available (stopped) exclusive is true and the tag is pacemaker. Someone got hints? tia! Yeah, those aren't errors. It's just telling you that the LVM agent stopped successfully. I would expect to see these after you did a failover or resource recovery. Is the resource not starting and stopping correctly for you? If not, I'll need more logs. -- Vossel infoomatic ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] drbd + lvm
- Original Message - From: Infoomatic infooma...@gmx.at To: pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 2:26:00 PM Subject: [Pacemaker] drbd + lvm Hi list, I am having troubles with pacemaker and lvm and stacked drbd resources. The system consists of 2 Ubuntu 12 LTS servers, each having two partitions of an underlying raid 1+0 as volume group with one LV each as a drbd backing device. The purpose is for usage with VMs and adjusting needed disk space flexible, so on top of the drbd resources there are LVs for each VM. I created a stack with LCMC, which is like: DRBD-LV-libvirt and DRBD-LV-Filesystem-lxc The problem now: the system has hickups - when VM01 runs on HOST01 (being primary DRBD) and HOST02 is restarting, lvm is reloaded (at boot time) and the LVs are being activated. This of course results in an error, the log entry: Mar 13 17:58:42 host01 pengine: [27563]: ERROR: native_create_actions: Resource res_LVM_1 (ocf::LVM) is active on 2 nodes attempting recovery Therefore, as configured, the resource is stopped and started again (on only one node). Thus, all VMs and containers relying on this are also restared. When I disable the LVs that use the DRBD resource at boot (lvm.conf: volume_list only containing the VG from the partitions of the raidsystem) a reboot of the secondary does not restart the VMs running on the primary. However, if the primary goes down (e.g. power interruption), the secondary cannot activate the LVs of the VMs because they are not in the list of lvm.conf to be activated. Has anyone had this issue and resolved it? Any ideas? Thanks in advance! Yep, i've hit this as well. Use the latest LVM agent. I already fixed all of this. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/LVM Keep your volume_list the way it is and use the 'exclusive=true' LVM option. This will allow the LVM agent to activate volumes that don't exist in the volume_list. Hope that helps -- Vossel infoomatic ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] process/service watcher
- Original Message - From: Yair Ogen (yaogen) yao...@cisco.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 9:22:44 AM Subject: Re: [Pacemaker] process/service watcher Thanks Frank, so you confirm that pacemaker doesn’t offer this? Yes, you can run a single node cluster. It sounds like it doesn't make any sense, but I've actually seen this used in ways I wouldn't have expected. It has valid use-cases. -- Vossel Yair From: Frank Brendel [mailto:frank.bren...@eurolog.com] Sent: Thursday, March 13, 2014 16:05 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] process/service watcher Hi Yair, try monit http://mmonit.com/monit/ Regards Frank Am 13.03.2014 14:24, schrieb Yair Ogen (yaogen): Does pacemaker have an option to act as a process / service watcher regardless to being part of a cluster? i.e. watch a process and identify when it’s down and re-start it. I am looking for a software solution that does this even a non-clustered environment. Thanks. Regards, Yair ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] How to delay first monitor op upon resource start?
- Original Message - From: Gianluca Cecchi gianluca.cec...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 12:00:16 PM Subject: [Pacemaker] How to delay first monitor op upon resource start? Hello, I have some init based scripts that I configure as lsb resources. They are java based (in this case ovirt-engine and ovirt-websocket-proxy from oVirt project) and they are started through the rhel daemon function. Basically it needs a few seconds before the scripts exit and the status option returns ok. So most of times when used as resources in pacemaker, their start is registered as FAILED because the status call happens too quickly. In the mean time I solved the problem putting a sleep 5 before the exit, but I would like to know if I can set a resource or cluster parameter so that the first status monitor after start is delayed. So I don't need to ask maintainer to make the change to the script and I don't need after every update to remember to re-modify the script. This is a problem with the LSB script. No scripts that pacemaker manages should ever return start until status passes. The status passing should be a condition for start passing. You should make a loop at the end of the start function that waits for status to pass before returning. with that said... there is a way to delay the monitor operation in pacemaker like you are wanting. This is a terrible idea, i don't recommend it, and i don't guarantee it won't get deprecated entirely someday. the option is called 'start-delay' and you set it within the monitor operation section (same place interval and timeout are set). Set that option to the amount of milliseconds you want to delay the operation execution. -- Vossel Another option would be to try the status after start more than once so that eventually the first time is not ok, but it is so the second one Thanks in advance, Gianluca ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Don't want to stop lsb resource on migration
- Original Message - From: Bingham knee-jerk-react...@hotmail.com To: pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 9:00:55 AM Subject: [Pacemaker] Don't want to stop lsb resource on migration Hello, My setup: I have a 2 node cluster using pacemaker and heartbeat. I have 2 resources, ocf::heartbeat:IPaddr and lsb:rabbitmq-server. I have these 2 resources grouped together and they will fail over to the other node. question: When rabbitmq is migrated to node1 from node2 I would like to 'not' have the the /etc/init.d/rabbitmq-server stop happen on the failed server (node1 in this example). Is it possible to do this in crm? I realize that I could hack the initscript's case statement for stop to just exit 0, but I am hoping there is a way to do this in crm. there isn't Thanks for any help, Steve ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
- Original Message - From: Jan Friesse jfrie...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 13, 2014 4:03:28 AM Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) yes, there was a libqb/corosync interoperation problem that showed these same symptoms last year. Updating to the latest corosync and libqb will likely resolve this. - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker remote and persistent remote node attributes
- Original Message - From: Покотиленко Костик cas...@meteor.dp.ua To: pacemaker@oss.clusterlabs.org Sent: Thursday, March 6, 2014 1:42:25 PM Subject: [Pacemaker] Pacemaker remote and persistent remote node attributes Hi, I'm new here. awesome, welcome :D I'm looking for ways of migrating to pacemaker of the current setup which is ~10 KVM hypervisors with ~20 VMs each. There are few VM classes each running it's set of services. Each service is run on =2 VMs on different HV for balancing. Failover and management is to be added. As far as I know there are 3 ways to manage this setup in pacemaker: 1. make cluster of VMs, use libvirt fencing, have problems 2. make cluster of VMs, make different cluster of HVs, propagate fencing from VM cluster to HV cluster to do real fencing of VMs and HVs. Not sure how to do this. I've found a solution for XEN from RedHat, but not for KVM 3. use pacemaker with pacemaker-remote to manage VMs and services on both HVs and VMs, have fun maybe I'm biased, but I like option 3 Tell me what I missed. I research pacemaker-remote option now. It seems to be best, but it doesn't support remote node persistent attributes for now (1.1.11). With node attributes I can place services by node class and location rules in this case are very simple. So the question is: are the remote node persistent attributes going to be implemented yes, I plan on doing this for 1.1.12 or what are the workarounds? I'm not aware of a good one :/ P.S. Some other questions: - what are the reliable versions of pacemaker and corosync? Now I use pacemaker 1.1.11 and corosync 2.3.0 backported for Ubuntu 12.04 LTS - what is the preferred cli now? crmsh, pcs or what? Ha, that's a loaded question. pcs is gaining a lot of traction and it is what I use exclusively now. - is pacemaker-mgmt (pygui) 2.1.2 supposed to work with pacemaker 1.1.11 and corosync 2.3.0? I have it installed and enabled (use_mgmtd: yes), but it isn't loading with no errors. I had it working with stock Ubuntu pacemaker/corosync. - is cman a requirement for DRBD+OCFS2 or it can be safely run with corosync on Ubuntu/Debian? I don't use debian, someone else would have to give their input. Good Luck! I'm interested to hear how your setup goes. -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stopping resource using pcs
- Original Message - From: K Mehta kiranmehta1...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Friday, February 28, 2014 7:05:47 AM Subject: Re: [Pacemaker] Stopping resource using pcs Can anyone tell me why --wait parameter always causes pcs resource disable to return failure though resource actually stops within time ? does it only show an error with multi-state resources? It is probably a bug. -- Vossel On Wed, Feb 26, 2014 at 10:45 PM, K Mehta kiranmehta1...@gmail.com wrote: Deleting master resource id does not work. I see the same issue. However, uncloning helps. Delete works after disabling and uncloning. I see anissue in using --wait option with disable. Resources moves into stopped state but still error an error message is printed. When --wait option is not provided, error message is not seen [root@sys11 ~]# pcs resource Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8] Masters: [ sys11 ] Slaves: [ sys12 ] [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 --wait Error: unable to stop: 'ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8', please check logs for failure information [root@sys11 ~]# pcs resource Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8] Stopped: [ vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:0 vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:1 ] [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 --wait Error: unable to stop: 'ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8', please check logs for failure information error message [root@sys11 ~]# pcs resource enable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [root@sys11 ~]# pcs resource Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8] Masters: [ sys11 ] Slaves: [ sys12 ] [root@sys11 ~]# pcs resource disable ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [root@sys11 ~]# pcs resource Master/Slave Set: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 [vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8] Stopped: [ vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:0 vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8:1 ] On Wed, Feb 26, 2014 at 8:55 PM, David Vossel dvos...@redhat.com wrote: - Original Message - From: Frank Brendel frank.bren...@eurolog.com To: pacemaker@oss.clusterlabs.org Sent: Wednesday, February 26, 2014 8:53:19 AM Subject: Re: [Pacemaker] Stopping resource using pcs I guess we need some real experts here. I think it's because you're attempting to delete the resource and not the Master. Try deleting the Master instead of the resource. Yes, delete the Master resource id, not the primitive resource within the master. When using pcs, you should always refer to the resource's top most parent id, not the id of the children resources within the parent. If you make a resource a clone, start using the clone id. Same with master. If you add a resource to a group, reference the group id from then on and not any of the children resources within the group. As a general practice, it is always better to stop a resource (pcs resource disable) and only delete the resource after the stop has completed. This is especially important for group resources where stop order matters. If you delete a group, then we have no information on what order to stop the resources in that group. This can cause stop failures when the orphaned resources are cleaned up. Recently pcs gained the ability to attempt to stop resources before deleting them in order to avoid scenarios like i described above. Pcs will block for a period of time waiting for the resource to stop before deleting it. Even with this logic in place it is preferred to stop the resource manually then delete the resource once you have verified it stopped. -- Vossel I had a similar problem with a cloned group and solved it by un-cloning before deleting the group. Maybe un-cloning the multi-state resource could help too. It's easy to reproduce. # pcs resource create resPing ping host_list=10.0.0.1 10.0.0.2 op monitor on-fail=restart # pcs resource group add groupPing resPing # pcs resource clone groupPing clone-max=3 clone-node-max=1 # pcs resource Clone Set: groupPing-clone [groupPing] Started: [ node1 node2 node3 ] # pcs resource delete groupPing-clone Deleting Resource (and group) - resPing Error: Unable to remove resource 'resPing' (do constraints exist?) # pcs resource unclone groupPing # pcs resource delete groupPing Removing group: groupPing (and all resources within group) Stopping all resources in group: groupPing... Deleting Resource (and group) - resPing Log: Feb 26 15:43:16 node1 cibadmin[2368]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o resources -D --xml-text group id=groupPing#012
Re: [Pacemaker] Stopping resource using pcs
- Original Message - From: Frank Brendel frank.bren...@eurolog.com To: pacemaker@oss.clusterlabs.org Sent: Wednesday, February 26, 2014 8:53:19 AM Subject: Re: [Pacemaker] Stopping resource using pcs I guess we need some real experts here. I think it's because you're attempting to delete the resource and not the Master. Try deleting the Master instead of the resource. Yes, delete the Master resource id, not the primitive resource within the master. When using pcs, you should always refer to the resource's top most parent id, not the id of the children resources within the parent. If you make a resource a clone, start using the clone id. Same with master. If you add a resource to a group, reference the group id from then on and not any of the children resources within the group. As a general practice, it is always better to stop a resource (pcs resource disable) and only delete the resource after the stop has completed. This is especially important for group resources where stop order matters. If you delete a group, then we have no information on what order to stop the resources in that group. This can cause stop failures when the orphaned resources are cleaned up. Recently pcs gained the ability to attempt to stop resources before deleting them in order to avoid scenarios like i described above. Pcs will block for a period of time waiting for the resource to stop before deleting it. Even with this logic in place it is preferred to stop the resource manually then delete the resource once you have verified it stopped. -- Vossel I had a similar problem with a cloned group and solved it by un-cloning before deleting the group. Maybe un-cloning the multi-state resource could help too. It's easy to reproduce. # pcs resource create resPing ping host_list=10.0.0.1 10.0.0.2 op monitor on-fail=restart # pcs resource group add groupPing resPing # pcs resource clone groupPing clone-max=3 clone-node-max=1 # pcs resource Clone Set: groupPing-clone [groupPing] Started: [ node1 node2 node3 ] # pcs resource delete groupPing-clone Deleting Resource (and group) - resPing Error: Unable to remove resource 'resPing' (do constraints exist?) # pcs resource unclone groupPing # pcs resource delete groupPing Removing group: groupPing (and all resources within group) Stopping all resources in group: groupPing... Deleting Resource (and group) - resPing Log: Feb 26 15:43:16 node1 cibadmin[2368]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin -o resources -D --xml-text group id=groupPing#012 primitive class=ocf id=resPing provider=pacemaker type=ping#012 instance_attributes id=resPing-instance_attributes#012 nvpair id=resPing-instance_attributes-host_list name=host_list value=10.0.0.1 10.0.0.2/#012 /instance_attributes#012 operations#012 op id=resPing-monitor-on-fail-restart interval=60s name=monitor on-fail=restart/#012 /operations#012 /primi Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Expecting an element meta_attributes, got nothing Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Invalid sequence in interleave Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element clone failed to validate content Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element resources has extra content: primitive Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Invalid sequence in interleave Feb 26 15:43:16 node1 cib[1820]: error: xml_log: Element cib failed to validate content Feb 26 15:43:16 node1 cib[1820]: warning: cib_perform_op: Updated CIB does not validate against pacemaker-1.2 schema/dtd Feb 26 15:43:16 node1 cib[1820]: warning: cib_diff_notify: Update (client: cibadmin, call:2): 0.516.7 - 0.517.1 (Update does not conform to the configured schema) Feb 26 15:43:16 node1 stonith-ng[1821]: warning: update_cib_cache_cb: [cib_diff_notify] ABORTED: Update does not conform to the configured schema (-203) Feb 26 15:43:16 node1 cib[1820]: warning: cib_process_request: Completed cib_delete operation for section resources: Update does not conform to the configured schema (rc=-203, origin=local/cibadmin/2, version=0.516.7) Frank Am 26.02.2014 15:00, schrieb K Mehta: Here is the config and output of few commands [root@sys11 ~]# pcs config Cluster Name: kpacemaker1.1 Corosync Nodes: Pacemaker Nodes: sys11 sys12 Resources: Master: ms-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 Meta Attrs: clone-max=2 globally-unique=false target-role=Started Resource: vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8 (class=ocf provider=heartbeat type=vgc-cm-agent.ocf) Attributes: cluster_uuid=de5566b1-c2a3-4dc6-9712-c82bb43f19d8 Operations: monitor interval=30s role=Master timeout=100s (vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8-monitor-interval-30s) monitor interval=31s role=Slave timeout=100s (vha-de5566b1-c2a3-4dc6-9712-c82bb43f19d8-monitor-interval-31s) Stonith Devices: Fencing Levels: Location Constraints: Resource:
Re: [Pacemaker] getting started with development
- Original Message - From: Tasim Noor tasimn...@gmail.com To: pacemaker@oss.clusterlabs.org Sent: Tuesday, February 25, 2014 5:10:33 PM Subject: [Pacemaker] getting started with development Hi All, I would be interested in contributing to the pacemaker/linux HA codebase. I did look through the TODO but it doesn't say which of topics are currently worked on and which ones are open to be taken up. i would appreciate if somebody can point me to a starting point i.e some feature that i can start looking at to get my hands dirty along with some pointers to specific source files as a starting point. Awesome, one of the things we recommend to pacemaker developers is to learn about our unit tests. Specifically, being able to run CTS in a virtualized environment with 3 or more nodes is a good exercise. I run CTS on KVM instances using libvirt. For fencing I use fence_virtd on the host machine and the fence_xvm agent within the guest vms. After you get fence_virtd running and accessible from the guests ('fence_xvm -o list' should list all the running guest vms when executed from a guest vm) you can use the steps I have outlined in the scenario file below to execute cts. https://github.com/davidvossel/phd/blob/master/scenarios/cts-virt.scenario CTS is not strictly required for all pacemaker development. Depending on how deep you want go it is very helpful at verifying invasive changes. -- Vossel Thanks for your help. Kind Regards, Tasim ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Possible error in RA invocation
- Original Message - From: Santiago Pérez santiago.pe...@entertainment-solutions.eu To: pacemaker@oss.clusterlabs.org Sent: Thursday, January 30, 2014 1:50:41 PM Subject: [Pacemaker] Possible error in RA invocation Hi everyone, I am running a two-node cluster which hosts two Xen VMs. We're using DRBD, but it's managed directly from Xen. The configuration of one of this resources is as follows: primitive xen-vm1 ocf:heartbeat:Xen params xmfile=/etc/xen/vm1.cfg op monitor interval=30s op start interval=0 timeout=60s op stop interval=0 timeout=300s op migrate_from interval=0 timeout=240 ingerval=0 op migrate_to interval=0 timeout=240 meta allow-migrate=true target-role=Started meta target-role=Started I have a problem with the monitor operation. It seems to be working fine... until it doesn't. The cluster can be running for weeks without any failure, but sometimes the monitor operation fails with a really strange error from the resource agent. This is an excerpt of one of the failures: Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 11756) Jan 28 14:40:20 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 11756 exited with return code 0 Jan 28 15:40:26 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 18065) Jan 28 15:40:27 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 18065 exited with return code 0 Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 24373) Jan 28 16:40:32 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 24373 exited with return code 0 Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 30686) Jan 28 17:40:38 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 30686 exited with return code 0 Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 monitor[71] (pid 4593) Jan 28 18:40:44 xenhost1 lrmd: [3822]: info: operation monitor[71] on xen-vm1 for client 3825: pid 4593 exited with return code 0 Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/Xen: 71: local: Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) en-list: bad variable name This is weird. It is almost like your shell environment is borked. I'm not sure what is causing this. -- Vossel Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: RA output: (xen-vm1:monitor:stderr) Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: cancel_op: operation monitor[71] on xen-vm1 for client 3825, its parameters: crm_feature_set=[3.0.6] xmfile=[/etc/xen/vm1.cfg] CRM_meta_name=[monitor] CRM_meta_interval=[3] CRM_meta_timeout=[2] cancelled Jan 28 18:55:23 xenhost1 lrmd: [3822]: info: rsc:xen-vm1 stop[72] (pid 6219) The machines are very low on resources, and this unnecessary migration is causing problems. The systems are running Debian Wheezy with pacemaker 1.1.7-1 and resource-agents 3.9.2-5+deb7u1. I don't know yet if there's a problem with the Xen RA, the lrmd service itself or my configuration. I wasn't able to find any information related to this issue. Do you have any idea of what could be causing this? Any help will be appreciated. Regards, Santiago ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS
- Original Message - From: Jefferson Carlos Machado lista.li...@results.com.br To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, February 11, 2014 7:03:50 AM Subject: Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS Hi Vossel, I allready do this. Resource: home (class=ocf provider=heartbeat type=Filesystem) Attributes: device=localhost:/home_gv directory=/home fstype=glusterfs Operations: start interval=0 timeout=60 (home-start-interval-0) stop interval=0 timeout=240 (home-stop-interval-0) monitor interval=30s role=Started (home-monitor-interval-0) But when I try start I get error bellow and can see error in the log. Operation start for home (ocf:heartbeat:Filesystem) returned 1 stdout: Mount failed. Please check the log file for more details. stderr: INFO: Running start for localhost:/home_gv on /home stderr: ERROR: Couldn't mount filesystem localhost:/home_gv on /home Work fine with fstab and mount -a Ah, It looks like the FileSystem agent needs to be updated to work correctly with gluster then. -- Vossel [root@srvmail0 ~]# mount -a [root@srvmail0 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup-lv_root 1491664 1298072117816 92% / tmpfs 31223644080268156 15% /dev/shm /dev/xvda1 49584475560394684 17% /boot /dev/xvdb120954552 16049876 4904676 77% /gv /dev/xvdc1 2063504 1133824824860 58% /var localhost:/gv_home20954496 16049920 4904576 77% /home [root@srvmail0 ~]# cat /etc/fstab # # /etc/fstab # Created by anaconda on Wed Dec 19 18:01:54 2012 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # /dev/mapper/VolGroup-lv_root / ext4 defaults1 1 UUID=a7af8398-cbea-495f-80cd-1a642d94d9f4 /boot ext4defaults1 2 /dev/mapper/VolGroup-lv_swap swapswap defaults0 0 tmpfs /dev/shmtmpfs defaults0 0 devpts /dev/ptsdevpts gid=5,mode=620 0 0 sysfs /syssysfs defaults0 0 proc/proc proc defaults0 0 /dev/xvdb1 /gv xfsdefaults1 1 /dev/xvdc1 /var ext4defaults1 1 localhost:/gv_home /home glusterfs _netdev 0 0 [root@srvmail0 ~]# Regards, Em 07-02-2014 17:53, David Vossel escreveu: - Original Message - From: Jefferson Carlos Machado lista.li...@results.com.br To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org, gluster-us...@gluster.org Sent: Friday, February 7, 2014 11:55:37 AM Subject: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS Hi, How the best way to create a resource filesystem managed type glusterfs? I suppose using the Filesystem resource agent. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem -- Vossel Regards, ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.)
- Original Message - From: renayama19661...@ybb.ne.jp To: PaceMaker-ML pacemaker@oss.clusterlabs.org Sent: Monday, February 17, 2014 7:06:53 PM Subject: [Pacemaker] [Problem] Fail-over is delayed.(State transition is not calculated.) Hi All, I confirmed movement at the time of the trouble in one of Master/Slave in Pacemaker1.1.11. - Step1) Constitute a cluster. [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:07:24 2014 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-9d39a6b 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy): Started srv01 vip-rep(ocf::heartbeat:Dummy): Started srv01 Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Node Attributes: * Node srv01: + default_ping_set : 100 + master-pgsql : 10 * Node srv02: + default_ping_set : 100 + master-pgsql : 5 Migration summary: * Node srv01: * Node srv02: Step2) Monitor error in vip-master. [root@srv01 ~]# rm -rf /var/run/resource-agents/Dummy-vip-master.state [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:07:58 2014 Last change: Tue Feb 18 18:05:46 2014 via crmd on srv01 Stack: corosync Current DC: srv01 (3232238180) - partition with quorum Version: 1.1.10-9d39a6b 2 Nodes configured 6 Resources configured Online: [ srv01 srv02 ] Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Node Attributes: * Node srv01: + default_ping_set : 100 + master-pgsql : 10 * Node srv02: + default_ping_set : 100 + master-pgsql : 5 Migration summary: * Node srv01: vip-master: migration-threshold=1 fail-count=1 last-failure='Tue Feb 18 18:07:50 2014' * Node srv02: Failed actions: vip-master_monitor_1 on srv01 'not running' (7): call=30, status=complete, last-rc-change='Tue Feb 18 18:07:50 2014', queued=0ms, exec=0ms - However, the resource does not fail-over. But, fail-over is calculated when I check cib in crm_simulate at this point in time. - [root@srv01 ~]# crm_simulate -L -s Current cluster status: Online: [ srv01 srv02 ] vip-master (ocf::heartbeat:Dummy): Stopped vip-rep(ocf::heartbeat:Dummy): Stopped Master/Slave Set: msPostgresql [pgsql] Masters: [ srv01 ] Slaves: [ srv02 ] Clone Set: clnPingd [prmPingd] Started: [ srv01 srv02 ] Allocation scores: clone_color: clnPingd allocation score on srv01: 0 clone_color: clnPingd allocation score on srv02: 0 clone_color: prmPingd:0 allocation score on srv01: INFINITY clone_color: prmPingd:0 allocation score on srv02: 0 clone_color: prmPingd:1 allocation score on srv01: 0 clone_color: prmPingd:1 allocation score on srv02: INFINITY native_color: prmPingd:0 allocation score on srv01: INFINITY native_color: prmPingd:0 allocation score on srv02: 0 native_color: prmPingd:1 allocation score on srv01: -INFINITY native_color: prmPingd:1 allocation score on srv02: INFINITY clone_color: msPostgresql allocation score on srv01: 0 clone_color: msPostgresql allocation score on srv02: 0 clone_color: pgsql:0 allocation score on srv01: INFINITY clone_color: pgsql:0 allocation score on srv02: 0 clone_color: pgsql:1 allocation score on srv01: 0 clone_color: pgsql:1 allocation score on srv02: INFINITY native_color: pgsql:0 allocation score on srv01: INFINITY native_color: pgsql:0 allocation score on srv02: 0 native_color: pgsql:1 allocation score on srv01: -INFINITY native_color: pgsql:1 allocation score on srv02: INFINITY pgsql:1 promotion score on srv02: 5 pgsql:0 promotion score on srv01: 1 native_color: vip-master allocation score on srv01: -INFINITY native_color: vip-master allocation score on srv02: INFINITY native_color: vip-rep allocation score on srv01: -INFINITY native_color: vip-rep allocation score on srv02: INFINITY Transition Summary: * Start vip-master (srv02) * Start vip-rep (srv02) * Demote pgsql:0 (Master - Slave srv01) * Promote pgsql:1 (Slave - Master srv02) - In addition, fail-over is calculated even if cluster_recheck_interval is carried out. Fail-over is carried out even if I carry out cibadmin -B. - [root@srv01 ~]# cibadmin -B [root@srv01 ~]# crm_mon -1 -Af Last updated: Tue Feb 18 18:21:15 2014 Last change: Tue Feb 18
Re: [Pacemaker] node1 fencing itself after node2 being fenced
- Original Message - From: Vladislav Bogdanov bub...@hoster-ok.com To: pacemaker@oss.clusterlabs.org Sent: Tuesday, February 18, 2014 1:02:09 PM Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced 18.02.2014 19:49, Asgaroth wrote: i sometimes have the same situation. sleep ~30 seconds between startup cman and clvmd helps a lot. Thanks for the tip, I just tried this (added sleep 30 in the start section of case statement in cman script, but this did not resolve the issue for me), for some reason clvmd just refuses to start, I don’t see much debugging errors shooting up, so I cannot say for sure what clvmd is trying to do :( I actually just made a patch related to this. If you are managing the dlm with pacemaker, you'll want to use this patch. It disables startup fencing in the dlm and has pacemaker perform the fencing instead. The agent checks the startup fencing condition, so you'll need that bit as well instead of just disabling startup fencing in the dlm. -- Vossel Just a guess. Do you have startup fencing enabled in dlm-controld (I actually do not remember if it is applicable to cman's version, but it exists in dlm-4) or cman? If yes, then that may play its evil game, because imho it is not intended to use with pacemaker which has its own startup fencing policy (if you redirect fencing to pacemaker). ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] nfs4 cluster fail-over stops working once I introduce ipaddr2 resource
- Original Message - From: Dennis Jacobfeuerborn denni...@conversis.de To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, February 13, 2014 11:18:04 PM Subject: Re: [Pacemaker] nfs4 cluster fail-over stops working once I introduce ipaddr2 resource On 14.02.2014 02:50, Dennis Jacobfeuerborn wrote: Hi, I'm still working on my NFSv4 cluster and things are working as expected...as long as I don't add an IPAddr2 resource. The DRBD, filesystem and exportfs resources work fine and when I put the active node into standby everything fails over as expected. Once I add a VIP as a IPAddr2 resource however I seem to get monitor problems with the p_exportfs_root resource. I've attached the configuration, status and a log file. The transition status is the status a moment after I take nfs1 (192.168.100.41) offline. It looks like the stopping of p_ip_nfs does something to the p_exportfs_root resource although I have no idea what that could be. The final status is the status after the cluster has settled. The fail-over finished but the failed action is still present and cannot be cleared with a crm resource cleanup p_exportfs_root. The log is the result of a tail -f on the corosync.log from the moment before I issued the crm node standby nfs1 to when the cluster has settled. Does anybody know what the issue could be here? At first I thought that using a VIP from the same network as the cluster nodes could be an issue but when I change this to use an IP in a different network 192.168.101.43/24 the same thing happens. The moment I remove p_ip_nfs from the configuration again fail-over back and forth works without a hitch. So after a lot of digging I think I pinpointed the issue: A race between the monitoring and stop actions of the exportfs resource script. When wait_for_leasetime_on_stop is set the following happens for the stop action and in this specific order: 1. The directory is unexported 2. Sleep nfs lease time + 2 seconds The problem seems to be that during the sleep phase the monitoring action is still invoked and since the directory has already been unexported it reports a failure. Once I add enabled=false to the monitoring action of the exportfs resource the problem disappears. The question is how to ensure that the monitoring action is not called while the stop action is still sleeping? Would it be a solution to create a lock file for the duration of the sleep and check for that lock file in the monitoring action? I'm not 100% sure if this analysis is correct because if monitoring right, I doubt that is happening. What happens if you put the ip before the nfs server. group p_ip_nfs g_nfs p_fs_data p_exportfs_root p_exportfs_data Without drbd, I have a scenario I test for active/passive nfs server here that works for me. https://github.com/davidvossel/phd/blob/master/scenarios/nfs-basic.scenario I'm using the actual nfsserver ocf script from the latest resource-agent github branch. -- Vossel calls are still made while the stop action is running this sounds inherently racy and would probably be an issue for almost all resource scripts not just exportfs. Regards, Dennis ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Announcing Pacemaker v1.1.11
I am excited to announce the release of Pacemaker v1.1.11 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11 There were no changes between the final release and rc5. This has been a very successful release process. I'm proud of the testing and contributions the community put into this release. Thank you all for your support, this community is great :D Looking forward to Pacemaker 1.1.12, we have a lot of new functionality on the horizon. Scaling pacemaker from a dozen or so nodes to hundreds possibly thousands of nodes is a very real and attainable goal for us this year. An announcement about 1.1.12 features and beta testing should arrive in the next few months. If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol that are present in this release. http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 2. Install dependancies (if you haven't already) [Fedora] # sudo yum install -y yum-utils [ALL]# make rpm-dep 3. Build Pacemaker # make release 4. Copy and deploy as needed ## Details - 1.1.11 - final Changesets: 462 Diff: 147 files changed, 6810 insertions(+), 4057 deletions(-) ## Highlights ### Features added since Pacemaker-1.1.10 + attrd: A truly atomic version of attrd for use where CPG is used for cluster communication + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Make the per-node action limit directly configurable in the CIB + crmd: Slow down recovery on nodes with IO load + crmd: Track CPU usage on cluster nodes and slow down recovery on nodes with high CPU/IO load + crm_mon: add --hide-headers option to hide all headers + crm_node: Display partition output in sorted order + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol ### Changes since Pacemaker-1.1.10 + Bug rhbz#1011618 - Consistently use 'Slave' as the role for unpromoted master/slave resources + Bug rhbz#1057697 - Use native DBus library for systemd and upstart support to avoid problematic use of threads + attrd: Any variable called 'cluster' makes the daemon crash before reaching main() + attrd: Avoid infinite write loop for unknown peers + attrd: Drop all attributes for peers that left the cluster + attrd: Give remote-nodes ability to set attributes with attrd + attrd: Prevent inflation of attribute dampen intervals + attrd: Support SI units for attribute dampening + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers + Bug rhbz#902407 - crm_resource: Handle --ban for master/slave resources as advertised + cib: Correctly check for archived configuration files + cib: Correctly log short-form xml diffs + cib: Fix remote cib based on TLS + cibadmin: Report errors during sign-off + cli: Do not enabled blackbox for cli tools + cluster: Fix segfault on removing a node + cman: Do not start pacemaker if cman startup fails + cman: Start clvmd and friends from the init script if enabled + Command-line tools should stop after an assertion failure + controld: Use the correct variant of dlm_controld for corosync-2 clusters + cpg: Correctly set the group name length + cpg: Ensure the CPG group is always null-terminated + cpg: Only process one message at a time to allow other priority jobs to be performed + crmd: Correctly observe the configured batch-limit + crmd: Correctly update expected state when the previous DC shuts down + crmd: Correcty update the history cache when recurring ops change their return code + crmd: Don't add node_state to cib, if we have not seen or fenced this node yet + crmd: don't segfault on shutdown when using heartbeat + crmd: Prevent recurring monitors being cancelled due to notify operations +
Re: [Pacemaker] ocf:lvm2:clvmd resource agent
- Original Message - From: Andrew Daugherity adaugher...@tamu.edu To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Wednesday, February 12, 2014 4:56:18 PM Subject: [Pacemaker] ocf:lvm2:clvmd resource agent I noticed in recent discussions on this list that this RA is apparently a SUSE thing and not upstreamed into resource-agents. This was news to me, but apparently is indeed the case. I've just introduced (as of today) a clvmd agent for review upstream. It is not the SUSE agent. I would like to encourage SUSE to merge their features into this agent and support the upstream effort here. https://github.com/ClusterLabs/resource-agents/pull/382 I guess it's SUSE's decision whether to push it upstream but IMO that would be the best way to go, so it could become the standard by-the-book way to use clvmd with pacemaker. Right now it lives in the lvm2-clvm RPM, which is in the SLES 11 HAE add-on and also in the standard OSS repo for openSUSE [1]. The rest of this message is directed more at the SUSE developers engineers who read this list; hopefully this is a more eyeballs = bugs are shallow thing than an annoyance... For now, is there a github repo or equivalent for this package, or do you just want people to file bugs with openSUSE/support requests with Novell? Reason I ask is, I noticed lots of log spamming by clvmd after upgrading from SLES 11 SP2 to SP3. Indeed clvmd is now running with the '-d2' option, which is the new default: # Parameter defaults : ${OCF_RESKEY_CRM_meta_globally_unique:=false} : ${OCF_RESKEY_daemon_timeout:=80} : ${OCF_RESKEY_daemon_options:=-d2} In SP2 it read ': ${OCF_RESKEY_daemon_options:=-d0}'. After adjusting my clvmd cluster resource to silence this, by adding daemon_options like so: primitive clvm ocf:lvm2:clvmd \ params daemon_options=-d0 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 syslog is back to normal. In the RPM changelog it looks like this was intentional, but the bug in question is marked private, so I have no idea why this was done: * Tue Jan 15 2013 dmzh...@suse.com - clvmd update to 2.20.98,fix colletive bugs. - fate#314367, cLVM should support option mirrored in a clustered environment - Fix debugging level set in clvmd_set_debug by using the correct variable (bnc#785467),change default -d0 to -d2 Can someone who has access explain why full -d2 debug mode is now the default? This doesn't seem like a sensible default. Thanks, Andrew Daugherity Systems Analyst Division of Research, Texas AM University [1] https://build.opensuse.org/package/show?project=openSUSE%3AFactorypackage=lvm2 Specifically, see clvmd.ocf. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS
- Original Message - From: Jefferson Carlos Machado lista.li...@results.com.br To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org, gluster-us...@gluster.org Sent: Friday, February 7, 2014 11:55:37 AM Subject: [Pacemaker] [Gluster-users] Pacemaker and GlusterFS Hi, How the best way to create a resource filesystem managed type glusterfs? I suppose using the Filesystem resource agent. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem -- Vossel Regards, ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 28, 2014 11:32:32 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 25 Jan 2014, at 2:36 am, David Vossel dvos...@redhat.com wrote: - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, January 23, 2014 10:08:35 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 I ran into a nasty bug where the crmd can infinitely block while retrieving metadata for systemd resources. This could affect other resource types as well, but I've only encountered it with systemd. There will be an RC5 so we can get this patch in. https://github.com/ClusterLabs/pacemaker/commit/b0ab1ccdb55dbead40fae097e4f84e445878afb1 David worked out the cause it the fact that glib uses threads for its GDBus code. The fix (nearly complete) is to use the dbus library directly. Andrew sorted out the whole GDBus threads issue, so it's finally time for a new (hopefully final) release candidate. If no issues are encountered, RC5 will become the final Pacemaker 1.1.11 release. Pacemaker-1.1.11-rc5 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc5 CHANGES Between RC4 and RC5 + Low: services: fix building dbus support in buildbot. + Low: services: Fix building dbus + Low: services: Keep dbus build support optional + Refactor: dbus: Use native function for adding arguments to messages + Fix: Bug rhbz#1057697 - Use native DBus library for upstart support to avoid problematic use of threads + Fix: Portability: Use basic types for DBus compatability struct + Build: Add dbus as an rpm dependancy + Refactor: systemd: Simplify dbus API usage + Fix: Bug rhbz#1057697 - Use native DBus library for systemd async support to avoid problematic use of threads + Fix: Bug rhbz#1057697 - Use native DBus library for systemd support to avoid problematic use of threads + Low: pengine: Regression test update for record-pending=true on migrate_to + Fix: xml: Fix segfault in find_entity() + Fix: cib: Fix remote cib based on TLS + High: services: Do not block synced service executions + Fix: cluster: Fix segfault on removing a node + Fix: services: Reset the scheduling policy and priority for lrmd's children without replying on SCHED_RESET_ON_FORK + Fix: services: Correctly reset the nice value for lrmd's children + High: pengine: Force record pending for migrate_to actions + High: pengine: cl#5186 - Avoid running rsc on two nodes when node is fenced during migration -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] error with pcs resource group command
- Original Message - From: Parveen Jain parveenj...@live.com To: pacemaker@oss.clusterlabs.org Sent: Thursday, January 23, 2014 9:24:39 AM Subject: [Pacemaker] error with pcs resource group command Hi Team, I was trying to add a group while converting from my CRM commands to pcs commands: following is the previous crm command: group vip-group vip-prim \ meta target-role=Started the command which I am trying to use is: pcs resource group add vip-group vip-prim meta target-role=Started but whenever I use this command, I get following output: Unable to find resource: meta Unable to find resource: target-role=Started pcs does not have a one to one mapping to crmsh commands. The 'pcs resource group add' command does not accept metadata. use pcs resource meta group id target-role=Started or 'pcs resource enable group id' will do the same thing. The pcs tool tells you what arguments the different commands take. You can view this for yourself. Use 'pcs resource help' to see resource options. You can look at the man page as well 'man pcs' and it has a detailed list. -- Vossel I even consulted the documentation, but it also gives the syntax I am using: https://access.redhat.com/site/documentation//en-US/Red_Hat_Enterprise_Linux/7-Beta/html/High_Availability_Add-On_Reference/s1-resourceopts-HAAR.html#tb-resource-options-HAAR Can anyone guide where I am doing wrong ? Thanks, Parveen ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta
- Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, January 27, 2014 12:15:23 PM Subject: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta Hi all, I'm having one heck of a time trying to get clvmd working with pacemaker 1.1.10 on RHEL 7 beta... I can configure DRBD dual-primary just fine. I can also configure DLM to start on both nodes just fine as well. However, once I try to add clvmd using lsb::clvmd, the cluster fails randomly. Here is the good config: snip/ Looking at your config, unless I'm missing something I don't see ordering constraints between dlm and clvmd. You need the start dlm-clone then start clvmd-clone order constraint as well as a colocate clvmd-clone with dlm-clone colocation constraint. Otherwise, you are going to run into random start and stop errors. Even after you do this, you may still have problems. The lsb:clvmd init script performs some unnecessary blocking operations during the 'status' operation. I have a patch attached to this issue, https://bugzilla.redhat.com/show_bug.cgi?id=1040670 , that should resolve that issue if you hit it. Hope that helps :) -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta
- Original Message - From: Lars Marowsky-Bree l...@suse.com To: pacemaker@oss.clusterlabs.org Sent: Monday, January 27, 2014 2:17:32 PM Subject: Re: [Pacemaker] Having a really hard time with clvmd on RHEL 7 beta On 2014-01-27T13:15:23, Digimer li...@alteeve.ca wrote: I try to configure clvmd this way: pcs cluster cib clvmd_cfg pcs -f clvmd_cfg resource create clvmd lsb:clvmd params daemon_timeout=30s op monitor interval=60s Hmmm. Something is not matching up here. lsb resources can't take parameters, can they? You are right. Parameters and LSB agents don't mix. The parameters will get stored in the cib i suppose, but the lrmd doesn't do anything with them for lsb agents. -- Vossel SUSE actually ships a separate ocf:lvm2:clvmd RA (within the lvm2-clvm package), which, I'm displeased to notice, wasn't contributed back upstream (or at least not merged). Wouldn't that make more sense then the LSB script - especially if parameters need to be specified? Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, January 23, 2014 10:08:35 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 15, 2014 5:16:40 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 7, 2014 4:50:11 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 15, 2014 5:16:40 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 7, 2014 4:50:11 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel Alright, New RC time. Pacemaker-1.1.11-rc3. If no regressions are encountered, rc3 will become the 1.1.11 final release a week from today. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3 CHANGES RC2 vs RC3 Fix: ipc: fix memory leak for failed ipc client connections. Fix: pengine: Fixes memory leak in regex pattern matching code
Re: [Pacemaker] Preventing Automatic Failback
- Original Message - From: Michael Monette mmone...@2keys.ca To: Michael Monette mmone...@2keys.ca, The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 22, 2014 12:08:09 AM Subject: Re: [Pacemaker] Preventing Automatic Failback This is the last time ill update this thread. I made some guesses in my last one but everything is clear now. I am still learning lots. I had two problems. I thought they were related but they were not. The DRBD problem was I had the wfc-timeout value to 30 in the drbd.conf and Pacemaker is default to 20 seconds. Second problem was I was missing 1 of the requirements of an compatible LSB script. There was no status option..So I made one using some if statements by grepping for the process to return 0 if true, 3 of not(whatever..im just experimenting for now). After modifying that script, and raising the DRBD start timeout to 120 in pacemaker, everything is working perfectly. Hope this helps someone down the road, thanks for your help Vossel. Mike. Great! Sounds like you worked it out :) -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Preventing Automatic Failback
- Original Message - From: Michael Monette mmone...@2keys.ca To: pacemaker@oss.clusterlabs.org Sent: Monday, January 20, 2014 8:22:25 AM Subject: [Pacemaker] Preventing Automatic Failback Hi, I posted this question before but my question was a bit unclear. I have 2 nodes with DRBD with Postgresql. When node-1 fails, everything fails to node-2 . But when node 1 is recovered, things try to failback to node-1 and all the services running on node-2 get disrupted(things don't ACTUALLY fail back to node-1..they try, fail, and then all services on node-2 are simply restarted..very annoying). This does not happen if I perform the same tests on node-2! I can reboot node-2, things fail to node-1 and node-2 comes online and waits until he is needed(this is what I want!) It seems to only affect my node-1's. I have tried to set resource stickiness, I have tried everything I can really think of, but whenever the Primary has recovered, it will always disrupt services running on node-2. Also I tried removing things from this config to try and isolate this. At one point I removed the atlassian_jira and drbd2_var primitives and only had a failover-ip and drbd1_opt, but still had the same problem. Hopefully someone can pinpoint this out for me. If I can't really avoid this, I would at least like to make this bug or whatever happen on node-2 instead of the actives. I bet this is due to the drbd resource's master score value on node1 being higher than node2. When you recover node1, are you actually rebooting that node? If node1 doesn't lose membership from the cluster (reboot), those transient attributes that the drbd agent uses to specify which node will be the master instance will stick around. Otherwise if you are just putting node1 in standby and then bringing the node back online, the I believe the resources will come back if the drbd master was originally on node1. If you provide a policy engine file that shows the unwanted transition from node2 back to node1, we'll be able to tell you exactly why it is occurring. -- Vossel Here is my config: node node-1.comp.com \ attributes standby=off node node-1.comp.com \ attributes standby=off primitive atlassian_jira lsb:jira \ op start interval=0 timeout=240 \ op stop interval=0 timeout=240 primitive drbd1_opt ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/opt/atlassian fstype=ext4 primitive drbd2_var ocf:heartbeat:Filesystem \ params device=/dev/drbd2 directory=/var/atlassian fstype=ext4 primitive drbd_data ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive failover-ip ocf:heartbeat:IPaddr2 \ params ip=10.199.0.13 group jira_services drbd1_opt drbd2_var failover-ip atlassian_jira ms ms_drbd_data drbd_data \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation jira_services_on_drbd inf: atlassian_jira ms_drbd_data:Master order jira_services_after_drbd inf: ms_drbd_data:promote jira_services:start property $id=cib-bootstrap-options \ dc-version=1.1.10-14.el6_5.1-368c726 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1390183165 \ default-resource-stickiness=INFINITY rsc_defaults $id=rsc-options \ resource-stickiness=INFINITY Thanks Mike ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource Status from the CIB
- Original Message - From: Michael Schwartzkopff m...@sys4.de To: pacemaker@oss.clusterlabs.org Sent: Monday, January 20, 2014 8:23:25 AM Subject: [Pacemaker] Resource Status from the CIB Hi, is it possible to read the status of a resource from the status part of the CIB? I can see a attribute rc-code=0 in lrm_rsc_op_resourceID... on an active node and direct yes, you can read the status if you want. It would be better to use crm_mon to interpret the status for you though. Here's some info on the status section. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_operation_history -- Vossel lrm_rsc_op - rc-code=7 on a node, where the where the resource is stopped. Is this the correct way to check it? Is there any documentation? Thanks. Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Question about new migration
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 15, 2014 3:42:23 AM Subject: Re: [Pacemaker] Question about new migration On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, With new migration logic, when VM was migrated by 'node standby', start was performed in migrate_target. (migrate_from was not performed.) Is this the designed behavior? no, this is a bug. In this instance the partial migration should have continued regardless if the transition was aborted or not. I don't think this is a new bug though, I think this existed in the previous migration logic as well. I think I understand what is going on though. I'll make a patch -- Vossel # crm_mon -rf1 Stack: corosync Current DC: bl460g1n6 (3232261592) - partition with quorum Version: 1.1.11-0.27.b48276b.git.el6-b48276b 2 Nodes configured 3 Resources configured Online: [ bl460g1n6 bl460g1n7 ] Full list of resources: prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n6 Clone Set: clnPing [prmPing] Started: [ bl460g1n6 bl460g1n7 ] Node Attributes: * Node bl460g1n6: + default_ping_set : 100 * Node bl460g1n7: + default_ping_set : 100 # crm node standby bl460g1n6 # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2 Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_migrate_to_0 Jan 15 15:39:28 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66, confirmed=true) ok Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_stop_0 Looks like the transition was aborted (5) and another (6) calculated. Compare action:transition:expected_rc:uuid key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b and key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b Jan 15 15:39:30 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68, confirmed=true) ok Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op: Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_start_0 Jan 15 15:39:30 bl460g1n7 crmd[29923]: notice: process_lrm_event: LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17, confirmed=true) ok Best Regards, Kazunori INOUE pcmk-Wed-15-Jan-2014.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Question about new migration
- Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 15, 2014 10:27:49 AM Subject: Re: [Pacemaker] Question about new migration - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 15, 2014 3:42:23 AM Subject: Re: [Pacemaker] Question about new migration On 15 Jan 2014, at 7:12 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, With new migration logic, when VM was migrated by 'node standby', start was performed in migrate_target. (migrate_from was not performed.) Is this the designed behavior? no, this is a bug. In this instance the partial migration should have continued regardless if the transition was aborted or not. I don't think this is a new bug though, I think this existed in the previous migration logic as well. I think I understand what is going on though. I'll make a patch This fixes the problem https://github.com/ClusterLabs/pacemaker/commit/b578680e4a16d915c130d5928cf9d9af296f2414 Thanks for testing the new migration logic out :D -- Vossel # crm_mon -rf1 Stack: corosync Current DC: bl460g1n6 (3232261592) - partition with quorum Version: 1.1.11-0.27.b48276b.git.el6-b48276b 2 Nodes configured 3 Resources configured Online: [ bl460g1n6 bl460g1n7 ] Full list of resources: prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n6 Clone Set: clnPing [prmPing] Started: [ bl460g1n6 bl460g1n7 ] Node Attributes: * Node bl460g1n6: + default_ping_set : 100 * Node bl460g1n7: + default_ping_set : 100 # crm node standby bl460g1n6 # egrep do_lrm_rsc_op:|process_lrm_event: ha-log | grep prmVM2 Jan 15 15:39:22 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_migrate_to_0 Jan 15 15:39:28 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_migrate_to_0 (call=16, rc=0, cib-update=66, confirmed=true) ok Jan 15 15:39:30 bl460g1n6 crmd[30795]: info: do_lrm_rsc_op: Performing key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_stop_0 Looks like the transition was aborted (5) and another (6) calculated. Compare action:transition:expected_rc:uuid key=11:5:0:be72ea63-75a9-4de4-a591-e716f960743b and key=7:6:0:be72ea63-75a9-4de4-a591-e716f960743b Jan 15 15:39:30 bl460g1n6 crmd[30795]: notice: process_lrm_event: LRM operation prmVM2_stop_0 (call=19, rc=0, cib-update=68, confirmed=true) ok Jan 15 15:39:30 bl460g1n7 crmd[29923]: info: do_lrm_rsc_op: Performing key=8:6:0:be72ea63-75a9-4de4-a591-e716f960743b op=prmVM2_start_0 Jan 15 15:39:30 bl460g1n7 crmd[29923]: notice: process_lrm_event: LRM operation prmVM2_start_0 (call=13, rc=0, cib-update=17, confirmed=true) ok Best Regards, Kazunori INOUE pcmk-Wed-15-Jan-2014.tar.bz2___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, January 7, 2014 4:50:11 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel Alright, New RC time. Pacemaker-1.1.11-rc3. If no regressions are encountered, rc3 will become the 1.1.11 final release a week from today. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3 CHANGES RC2 vs RC3 Fix: ipc: fix memory leak for failed ipc client connections. Fix: pengine: Fixes memory leak in regex pattern matching code for constraints. Low: Avoid potentially misleading and inaccurate compression time log msg Fix: crm_report: Suppress logging errors after the target directory has been compressed Fix: crm_attribute: Do not swallow hostname lookup failures Fix: crmd: Avoid deleting the 'shutdown' attribute Log: attrd: Quote attribute names Doc: Pacemaker_Explained: Fix formatting A new release candidate for pacemaker 1.1.11
Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Friday, January 10, 2014 5:23:04 AM Subject: Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1 2014/1/9 Andrew Beekhof and...@beekhof.net: On 8 Jan 2014, at 9:15 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: 2014/1/8 Andrew Beekhof and...@beekhof.net: On 18 Dec 2013, at 9:50 pm, Kazunori INOUE kazunori.ino...@gmail.com wrote: Hi David, 2013/12/18 David Vossel dvos...@redhat.com: That's a really weird one... I don't see how it is possible for op-id to be NULL there. You might need to give valgrind a shot to detect whatever is really going on here. -- Vossel Thank you for advice. I try it. Any update on this? We are still investigating a cause. It was not reproduced when I gave valgrind.. And it was reproduced in RC3. So it happened RC3 - valgrind, but not RC3 + valgrind? Thats concerning. Nothing in the valgrind output? The cause was found. 230 gboolean 231 operation_finalize(svc_action_t * op) 232 { 233 int recurring = 0; 234 235 if (op-interval) { 236 if (op-cancel) { 237 op-status = PCMK_LRM_OP_CANCELLED; 238 cancel_recurring_action(op); 239 } else { 240 recurring = 1; 241 op-opaque-repeat_timer = g_timeout_add(op-interval, 242 recurring_action_timer, (void *)op); 243 } 244 } 245 246 if (op-opaque-callback) { 247 op-opaque-callback(op); 248 } 249 250 op-pid = 0; 251 252 if (!recurring) { 253 /* 254 * If this is a recurring action, do not free explicitly. 255 * It will get freed whenever the action gets cancelled. 256 */ 257 services_action_free(op); 258 return TRUE; 259 } 260 return FALSE; 261 } When op-id is not 0, in cancel_recurring_action function (l.238), op is not removed from hash table. However, op is freed in services_action_free function (l.257). That is, the freed data remains in hash table. Then, g_hash_table_lookup function may look up the freed data. Therefore, when g_hash_table_replace function was called (in services_action_async function), I added change so that g_hash_table_remove function might certainly be called. As of now, segfault has not happened. Awesome, thanks for tracking this down. I created a modified version of your patch and put it up for review as a pacemaker pull request. https://github.com/ClusterLabs/pacemaker/pull/408 -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 2:25:00 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote: David/Andrew, Once 1.1.11 final is released, is it considered the new stable series of Pacemaker, yes or should 1.1.10 still be used in very stable/critical production environments? Thanks, Andrew - Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel Alright, New RC time. Pacemaker-1.1.11-rc3. If no regressions are encountered, rc3 will become the 1.1.11 final release a week from today. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc3 CHANGES RC2 vs RC3 Fix: ipc: fix memory leak for failed ipc client connections. Fix: pengine: Fixes memory leak in regex pattern matching code for constraints. Low: Avoid potentially misleading and inaccurate compression time log msg Fix: crm_report: Suppress logging errors after the target directory has been compressed Fix: crm_attribute: Do not swallow hostname lookup failures Fix: crmd: Avoid deleting the 'shutdown' attribute Log: attrd: Quote attribute names Doc: Pacemaker_Explained: Fix formatting -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker 1.1.10 + RHEL 7 beta issues
- Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, January 1, 2014 5:38:47 PM Subject: Re: [Pacemaker] pacemaker 1.1.10 + RHEL 7 beta issues Is this a bug? There's too much going on here for me to tell. Can you provide a crm_report that contains the pengine files before and after that constraint gets deleted allowing apache to start again? -- Vossel -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Forming single node cluster?
- Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, January 2, 2014 2:03:26 PM Subject: Re: [Pacemaker] Forming single node cluster? On 02/01/14 02:36 PM, John Wei wrote: Any way to form a single node cluster? I am evaluating pacemaker. Hope I can do this with just single server. John I'm not sure how much you can evaluate with just one node, but technically, I see no reason why you couldn't. You would have to disable both quorum and stonith, and there would be no resource recovery or migration of course. These pcs commands will accomplish this. pcs property set stonith-enabled=false pcs property set no-quorum-policy=ignore -- Vossel -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh
- Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Saturday, December 21, 2013 2:39:46 PM Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh Hi all, I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta VMs. I've got stonith configured and it technically works (crashed node reboots), but pacemaker hangs... Here is the config: Cluster Name: rhel7-pcmk Corosync Nodes: rhel7-01.alteeve.ca rhel7-02.alteeve.ca Pacemaker Nodes: rhel7-01.alteeve.ca rhel7-02.alteeve.ca Resources: Stonith Devices: Resource: fence_n01_virsh (class=stonith type=fence_virsh) Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01 Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s) Resource: fence_n02_virsh (class=stonith type=fence_virsh) Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot When using fence_virt, the easiest way to configure everything is to name the actual virtual machines the same as what their corosync node names are going to be. If you run this command in a virtual machine, you can see the names fence_virt thinks all the nodes are. fence_xvm -o list node1 c4dbe904-f51a-d53f-7ef0-2b03361c6401 on node2 c4dbe904-f51a-d53f-7ef0-2b03361c6402 on node3 c4dbe904-f51a-d53f-7ef0-2b03361c6403 on If you name the vm the same as the node name, you don't even have to list the static host list. Stonith will do all that magic behind the scenes. If the node names do not match, try the 'pcmk_host_map' option. I believe you should be able to map the corosync node name to the vm's name using that option. Hope that helps :) -- Vossel login=root passwd_script=/root/lemass.pw port=rhel7_02 Operations: monitor interval=60s (fence_n02_virsh-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: corosync dc-version: 1.1.10-19.el7-368c726 no-quorum-policy: ignore stonith-enabled: true Here are the logs: Dec 21 14:36:07 rhel7-01 corosync[1709]: [TOTEM ] A processor failed, forming new configuration. Dec 21 14:36:09 rhel7-01 corosync[1709]: [TOTEM ] A new membership (192.168.122.101:24) was formed. Members left: 2 Dec 21 14:36:09 rhel7-01 corosync[1709]: [QUORUM] Members[1]: 1 Dec 21 14:36:09 rhel7-01 corosync[1709]: [MAIN ] Completed service synchronization, ready to provide service. Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: crm_update_peer_state: pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now lost (was member) Dec 21 14:36:09 rhel7-01 crmd[1730]: warning: reap_dead_nodes: Our DC node (rhel7-02.alteeve.ca) left the cluster Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Dec 21 14:36:09 rhel7-01 pacemakerd[1724]: notice: crm_update_peer_state: pcmk_quorum_notification: Node rhel7-02.alteeve.ca[2] - state is now lost (was member) Dec 21 14:36:09 rhel7-01 crmd[1730]: notice: do_state_transition: State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Dec 21 14:36:10 rhel7-01 attrd[1728]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: pe_fence_node: Node rhel7-02.alteeve.ca will be fenced because fence_n02_virsh is thought to be active there Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: custom_action: Action fence_n02_virsh_stop_0 on rhel7-02.alteeve.ca is unrunnable (offline) Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: stage6: Scheduling Node rhel7-02.alteeve.ca for STONITH Dec 21 14:36:11 rhel7-01 pengine[1729]: notice: LogActions: Move fence_n02_virsh (Started rhel7-02.alteeve.ca - rhel7-01.alteeve.ca) Dec 21 14:36:11 rhel7-01 pengine[1729]: warning: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 21 14:36:11 rhel7-01 crmd[1730]: notice: te_fence_node: Executing reboot fencing operation (11) on rhel7-02.alteeve.ca (timeout=6) Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: handle_request: Client crmd.1730.4f6ea9e1 wants to fence (reboot) 'rhel7-02.alteeve.ca' with device '(any)' Dec 21 14:36:11 rhel7-01 stonith-ng[1726]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for rhel7-02.alteeve.ca: ea720bbf-aeab-43bb-a196-3a4c091dea75 (0) Dec 21 14:36:11 rhel7-01
Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh
- Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, December 23, 2013 12:42:23 PM Subject: Re: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh On 23/12/13 01:30 PM, David Vossel wrote: - Original Message - From: Digimer li...@alteeve.ca To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Saturday, December 21, 2013 2:39:46 PM Subject: [Pacemaker] Problem with stonith in rhel7 + pacemaker 1.1.10 + fence_virsh Hi all, I'm trying to learn pacemaker (still) using a pair of RHEL 7 beta VMs. I've got stonith configured and it technically works (crashed node reboots), but pacemaker hangs... Here is the config: Cluster Name: rhel7-pcmk Corosync Nodes: rhel7-01.alteeve.ca rhel7-02.alteeve.ca Pacemaker Nodes: rhel7-01.alteeve.ca rhel7-02.alteeve.ca Resources: Stonith Devices: Resource: fence_n01_virsh (class=stonith type=fence_virsh) Attributes: pcmk_host_list=rhel7-01 ipaddr=lemass action=reboot login=root passwd_script=/root/lemass.pw delay=15 port=rhel7_01 Operations: monitor interval=60s (fence_n01_virsh-monitor-interval-60s) Resource: fence_n02_virsh (class=stonith type=fence_virsh) Attributes: pcmk_host_list=rhel7-02 ipaddr=lemass action=reboot When using fence_virt, the easiest way to configure everything is to name the actual virtual machines the same as what their corosync node names are going to be. If you run this command in a virtual machine, you can see the names fence_virt thinks all the nodes are. fence_xvm -o list node1 c4dbe904-f51a-d53f-7ef0-2b03361c6401 on node2 c4dbe904-f51a-d53f-7ef0-2b03361c6402 on node3 c4dbe904-f51a-d53f-7ef0-2b03361c6403 on If you name the vm the same as the node name, you don't even have to list the static host list. Stonith will do all that magic behind the scenes. If the node names do not match, try the 'pcmk_host_map' option. I believe you should be able to map the corosync node name to the vm's name using that option. Hope that helps :) -- Vossel Hi David, I'm using fence_virsh, ah sorry, missed that. not fence_virtd/fence_xvm. For reasons I've not been able to resolve, fence_xvm has been unreliable on Fedora for some time now. the multicast bug :( -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Typo? pcs cluster push question....
- Original Message - From: Steven Silk - NOAA Affiliate steven.s...@noaa.gov To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 19, 2013 11:54:36 AM Subject: [Pacemaker] Typo? pcs cluster push question I have been working from http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_example.html at the bottom of the page is : Now push the configuration into the cluster # pcs cluster push cib stonith_cfg This does not work - I believe the proper syntax is # pcs cluster cib-push stonith_cfg Or am I working with a different version of pcs? You are using a newer version of pcs than that document is based off of. We need to go through and update some command syntax to reflect recent changes to pcs. -- Vossel thanks! Steven Silk 303 497 3112 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: David Vossel dvos...@redhat.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, December 11, 2013 3:33:46 PM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL]# make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. I have found a compile time error with 1.1.11 on rhel6 based systems and there has been a lrmd crash reported this week. After these issues get resolved there will be an rc3. This will result in 1.1.11's release being pushed to after the new year. Thanks for all the continued help in shaping v1.1.11 into a solid release. We are getting very close :) -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Tuesday, December 17, 2013 5:43:53 AM Subject: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1 Hi, When repeated 'node standby' and 'node online', lrmd crashed with SIGSEGV because op-id in cancel_recurring_action() was NULL. That's a really weird one... I don't see how it is possible for op-id to be NULL there. You might need to give valgrind a shot to detect whatever is really going on here. -- Vossel Dec 17 19:01:21 vm3 crmd[2433]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 17 19:01:21 vm3 crmd[2433]: info: do_te_invoke: Processing graph 437 (ref=pe_calc-dc-1387274481-5672) derived from /var/lib/pacemaker/pengine/pe-input-437.bz2 Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating action 17: stop prmStonith4_stop_0 on vm3 (local) Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing key=17:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 op=prmStonith4_stop_0 Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing - rsc:prmStonith4 action:stop call_id:3487 Dec 17 19:01:21 vm3 stonith-ng[2429]: info: stonith_command: Processed st_device_remove from lrmd.2430: OK (0) Dec 17 19:01:21 vm3 lrmd[2430]: info: log_finished: finished - rsc:prmStonith4 action:stop call_id:3487 exit-code:0 exec-time:0ms queue-time:0ms Dec 17 19:01:21 vm3 pengine[2432]: notice: process_pe_message: Calculated Transition 437: /var/lib/pacemaker/pengine/pe-input-437.bz2 Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating action 33: stop prmPg_stop_0 on vm3 (local) Dec 17 19:01:21 vm3 lrmd[2430]: info: cancel_recurring_action: Cancelling operation prmPg_monitor_1 Dec 17 19:01:21 vm3 crmd[2433]: info: do_lrm_rsc_op: Performing key=33:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 op=prmPg_stop_0 Dec 17 19:01:21 vm3 lrmd[2430]: info: log_execute: executing - rsc:prmPg action:stop call_id:3489 Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM operation prmStonith4_monitor_360 (call=3473, status=1, cib-update=0, confirmed=true) Cancelled Dec 17 19:01:21 vm3 crmd[2433]: notice: process_lrm_event: LRM operation prmStonith4_stop_0 (call=3487, rc=0, cib-update=3090, confirmed=true) ok Dec 17 19:01:21 vm3 crmd[2433]: info: process_lrm_event: LRM operation prmPg_monitor_1 (call=3485, status=1, cib-update=0, confirmed=true) Cancelled Dec 17 19:01:21 vm3 crmd[2433]: info: match_graph_event: Action prmStonith4_stop_0 (17) confirmed on vm3 (rc=0) Dec 17 19:01:21 vm3 crmd[2433]: notice: te_rsc_command: Initiating action 40: stop prmPing_stop_0 on vm3 (local) Dec 17 19:01:21 vm3 cib[2428]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/3090, version=0.440.2) Dec 17 19:01:21 vm3 stonith-ng[2429]: info: crm_client_destroy: Destroying 0 events Dec 17 19:01:21 vm3 pacemakerd[2424]:error: child_death_dispatch: Managed process 2430 (lrmd) dumped core Dec 17 19:01:21 vm3 pacemakerd[2424]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=2430, core=1) Dec 17 19:01:21 vm3 pacemakerd[2424]: notice: pcmk_process_exit: Respawning failed child process: lrmd Dec 17 19:01:21 vm3 pacemakerd[2424]:error: pcmk_process_exit: Rebooting system Dec 17 19:10:40 vm3 root: Mark:pcmk:1387275040 $ gdb /usr/libexec/pacemaker/lrmd core.2430 (gdb) bt #0 0x00323f8480ac in vfprintf () from /lib64/libc.so.6 #1 0x00323f86f9d2 in vsnprintf () from /lib64/libc.so.6 #2 0x003fcb81726d in qb_log_real_va_ (cs=0x3fcf208658, ap=0x76f5fc80) at log.c:230 #3 0x003fcb8173ea in qb_log_real_ (cs=0x3fcf208658) at log.c:255 #4 0x003fcf003a9c in cancel_recurring_action (op=0xb9fae0) at services.c:356 #5 0x003fcf003bc6 in services_action_cancel (name=0xb9f350 prmPing, action=0xb9ee90 monitor, interval=1) at services.c:381 #6 0x00406595 in cancel_op (rsc_id=0xb9f350 prmPing, action=0xb9ee90 monitor, interval=1) at lrmd.c:1197 #7 0x004067aa in process_lrmd_rsc_cancel (client=0xb926c0, id=7030, request=0xb95ad0) at lrmd.c:1261 #8 0x00406a51 in process_lrmd_message (client=0xb926c0, id=7030, request=0xb95ad0) at lrmd.c:1300 #9 0x00402a06 in lrmd_ipc_dispatch (c=0xb91af0, data=0x7f9f30acbc08, size=362) at main.c:141 #10 0x003fcb8126f8 in _process_request_ (c=0xb91af0, ms_timeout=10) at ipcs.c:698 #11 0x003fcb812ad5 in qb_ipcs_dispatch_connection_request (fd=5, revents=1, data=0xb91af0) at ipcs.c:801 #12 0x003fcc0327b1 in gio_read_socket (gio=0xb92880, condition=G_IO_IN, data=0xb91138) at mainloop.c:437 #13 0x003fc9c3feb2 in g_main_context_dispatch () from
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: Andrey Groshev gre...@yandex.ru To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 12, 2013 4:14:20 AM Subject: Re: [Pacemaker] Time to get ready for 1.1.11 And why not include it? https://github.com/beekhof/pacemaker/commit/a4bdc9a That commit is in the release candidate and will be included in the final 1.1.11 release :) https://github.com/ClusterLabs/pacemaker/commit/a4bdc9a ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.10 problems on CentOS 6.5
- Original Message - From: Diego Remolina diego.remol...@physics.gatech.edu To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, December 12, 2013 7:39:37 AM Subject: [Pacemaker] 1.1.10 problems on CentOS 6.5 I was successfully running 1.1.8 on a pair of CentOS 6.4 servers and after updating to CentOS 6.5 and 1.1.10, pacemaker miss-behaves. The first symptoms appeared with the 1.1.10-14.el6 packages. About 20 hours after the upgrade, the first drbd_monitor issues came out. Dec 09 18:50:12 Updated: pacemaker-libs-1.1.10-14.el6.x86_64 Dec 09 18:50:13 Updated: pacemaker-cli-1.1.10-14.el6.x86_64 Dec 09 18:50:13 Updated: pacemaker-cluster-libs-1.1.10-14.el6.x86_64 Dec 09 18:50:13 Updated: pacemaker-1.1.10-14.el6.x86_64 Dec 10 15:27:55 ysmha01 lrmd[3076]: warning: child_timeout_callback: drbd_export_monitor_29000 process (PID 19608) timed out Dec 10 15:27:55 ysmha01 lrmd[3076]: warning: operation_finished: drbd_export_monitor_29000:19608 - timed out after 2ms Dec 10 15:27:55 ysmha01 crmd[3079]:error: process_lrm_event: LRM operation drbd_export_monitor_29000 (77) Timed Out (timeout=2ms) Dec 10 15:27:56 ysmha01 crmd[3079]: notice: process_lrm_event: LRM operation drbd_export_notify_0 (call=99, rc=0, cib-update=0, confirmed=true) ok These errors look like a real resource failure. Pacemaker appears to be doing its job here. In this case the drbd script is being called, but never exiting (which results in the timeout). Your update of pacemaker likely has nothing to do with this. An update of anything DRBD related would make more sense. At this point, I tried taking the node to standby and back to online and cleaning the resources to no avail. I tried stopping pacemaker without luck. I rebooted both servers and on Dec 11, the failure started with failure to monitor pingd, then drbd_monitor. Dec 11 16:12:10 ysmha01 lrmd[3060]: warning: child_timeout_callback: pingd_monitor_2 process (PID 26237) timed out Dec 11 16:12:10 ysmha01 lrmd[3060]: warning: operation_finished: pingd_monitor_2:26237 - timed out after 15000ms Dec 11 16:12:10 ysmha01 crmd[3063]:error: process_lrm_event: LRM operation pingd_monitor_2 (35) Timed Out (timeout=15000ms) Dec 11 16:12:19 ysmha01 lrmd[3060]: warning: child_timeout_callback: drbd_export_monitor_29000 process (PID 26268) timed out Dec 11 16:12:19 ysmha01 lrmd[3060]: warning: operation_finished: drbd_export_monitor_29000:26268 - timed out after 2ms Dec 11 16:12:19 ysmha01 crmd[3063]:error: process_lrm_event: LRM operation drbd_export_monitor_29000 (62) Timed Out (timeout=2ms) I upgraded to the latest rpms yesterday afternoon (1.1.10-14.el6_5.1). Right before 1 am, there were issues again. Dec 12 00:49:39 ysmha01 pengine[3149]: notice: process_pe_message: Calculated Transition 41: /var/lib/pacemaker/pengine/pe-input-173.bz2 Dec 12 00:50:03 ysmha01 lrmd[3147]: warning: child_timeout_callback: drbd_export_monitor_29000 process (PID 18496) timed out Dec 12 00:50:03 ysmha01 lrmd[3147]: warning: operation_finished: drbd_export_monitor_29000:18496 - timed out after 2ms Dec 12 00:50:03 ysmha01 crmd[3150]:error: process_lrm_event: LRM operation drbd_export_monitor_29000 (60) Timed Out (timeout=2ms) I am for now manually running the machines without pacemaker. What suggestions do you have for me? What should I try first? Manually running the commands works? Something weird is going on. - Revert to 1.1.8? - Could be something related to drbd in the new kernel? Downgrade kernel rpm? I can post logs on request, where would be a good place to do that? make a crm_report, attach the crm_report file here. Thanks, Diego ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time to get ready for 1.1.11
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 9:02:40 PM Subject: [Pacemaker] Time to get ready for 1.1.11 With over 400 updates since the release of 1.1.10, its time to start thinking about a new release. Today I have tagged release candidate 1[1]. The most notable fixes include: + attrd: Implementation of a truely atomic attrd for use with corosync 2.x + cib: Allow values to be added/updated and removed in a single update + cib: Support XML comments in diffs + Core: Allow blackbox logging to be disabled with SIGUSR2 + crmd: Do not block on proxied calls from pacemaker_remoted + crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load + crmd: Use the load on our peers to know how many jobs to send them + crm_mon: add --hide-headers option to hide all headers + crm_report: Collect logs directly from journald if available + Fencing: On timeout, clean up the agent's entire process group + Fencing: Support agents that need the host to be unfenced at startup + ipc: Raise the default buffer size to 128k + PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules + PE: Allow location constraints to take a regex pattern to match against resource IDs + pengine: Distinguish between the agent being missing and something the agent needs being missing + remote: Properly version the remote connection protocol + services: Detect missing agents and permission errors before forking + Bug cl#5171 - pengine: Don't prevent clones from running due to dependant resources + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it is not already known + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers If you are a user of `pacemaker_remoted`, you should take the time to read about changes to the online wire protocol[2] that are present in this release. [1] https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1 [2] http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/ To build `rpm` packages for testing: 1. Clone the current sources: # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git # cd pacemaker 1. If you haven't already, install Pacemaker's dependancies [Fedora] # sudo yum install -y yum-utils [ALL] # make rpm-dep 1. Build Pacemaker # make rc 1. Copy the rpms and deploy as needed A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing. https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2 Assuming no major regressions are encountered during testing, this tag will become the final Pacemaker-1.1.11 release a week from today. -- Vossel ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource
- Original Message - From: Brian J. Murrell br...@interlinx.bc.ca To: pacema...@clusterlabs.org Sent: Monday, December 2, 2013 2:50:41 PM Subject: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource So, I'm migrating my working pacemaker configuration from 1.1.7 to 1.1.10 and am finding what appears to be a new behavior in 1.1.10. If a given node is running a fencing resource and that node goes AWOL, it needs to be fenced (of course). But any other node trying to take over the fencing resource to fence it appears to first want the current owner of the fencing resource to fence the node. Of course that can't happen since the node that needs to do the fencing is AWOL. So while I can buy into the general policy that a node needs to be fenced in order to take over it's resources, fencing resources have to be excepted from this or there can be this catch-22. We did away with all of the policy engine logic involved with trying to move fencing devices off of the target node before executing the fencing action. Behind the scenes all fencing devices are now essentially clones. If the target node to be fenced has a fencing device running on it, that device can execute anywhere in the cluster to avoid the suicide situation. When you are looking at crm_mon output and see a fencing device is running on a specific node, all that really means is that we are going to attempt to execute fencing actions for that device from that node first. If that node is unavailable, we'll try that same device anywhere in the cluster we can get it to work (unless you've specifically built some location constraint that prevents the fencing device from ever running on a specific node) Hope that helps. -- Vossel I believe that is how things were working in 1.1.7 but now that I'm on 1.1.10[-1.el6_4.4] this no longer seems to be the case. Or perhaps there is some additional configuration that 1.1.10 needs to effect this behavior. Here is my configuration: Cluster Name: Corosync Nodes: Pacemaker Nodes: host1 host2 Resources: Resource: rsc1 (class=ocf provider=foo type=Target) Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d Meta Attrs: target-role=Started Operations: monitor interval=5 timeout=60 (rsc1-monitor-5) start interval=0 timeout=300 (rsc1-start-0) stop interval=0 timeout=300 (rsc1-stop-0) Resource: rsc2 (class=ocf provider=chroma type=Target) Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515 Meta Attrs: target-role=Started Operations: monitor interval=5 timeout=60 (rsc2-monitor-5) start interval=0 timeout=300 (rsc2-start-0) stop interval=0 timeout=300 (rsc2-stop-0) Stonith Devices: Resource: st-fencing (class=stonith type=fence_foo) Fencing Levels: Location Constraints: Resource: rsc1 Enabled on: host1 (score:20) (id:rsc1-primary) Enabled on: host2 (score:10) (id:rsc1-secondary) Resource: rsc2 Enabled on: host2 (score:20) (id:rsc2-primary) Enabled on: host1 (score:10) (id:rsc2-secondary) Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: classic openais (with plugin) dc-version: 1.1.10-1.el6_4.4-368c726 expected-quorum-votes: 2 no-quorum-policy: ignore stonith-enabled: true symmetric-cluster: true One thing that PCS is not showing that might be relevant here is that I have a a resource stickiness value set to 1000 to prevent resources from failing back to nodes after a failover. With the above configuration if host1 is shut down, host2 just spins in a loop doing: Dec 2 20:00:02 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:00:02 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:00:02 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:00:02 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:00:02 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:00:02 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 - host2) Dec 2 20:00:02 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 - host2) Dec 2 20:00:02 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=6) Dec 2 20:00:02 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:00:02 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0) Dec 2 20:00:02 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 22:
Re: [Pacemaker] Fencing: Where?
- Original Message - From: Michael Schwartzkopff m...@sys4.de To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Monday, December 2, 2013 4:38:12 AM Subject: Re: [Pacemaker] Fencing: Where? Am Montag, 2. Dezember 2013, 13:15:23 schrieb Nikita Staroverov: Hi, as far as I unterstood RH is going to do all infrastructure in the cman layer and user pacemaker only for resource management. Whith this setup fencing also will be a job of the fenced of the cman package. This design has its advantages: If descisions are taked on a low level, all parts of the cluster have the chance to know about theses descisions. BUT: If pacemaker needs to fence a node, it has no possibility to so so any more. Imagine a resource will not stop and pacemaker would like to fence that node to go on. How would that situation be handled with fencing in cman? Is there any way pacemaker can tell cman about it's wish to fence an other nore? The only solution I can think of is to delegate the fencing in cman to pacemaker. So both layers are able to fence. But this cannot be a good solution, since we have to setup fencing on two places. Thanks for fruitful discussion. Mit freundlichen Grüßen, Michael Schwartzkopff fence_pcmk does the job. Have you got any troubles with it? No. I use it now. But setting up fencing in two places - cman - links to pacemaker - pamcemaker it not very nice to configure fencing in two places. You aren't really configuring fencing devices in two places. The fence_pcmk device in cluster.conf is just telling 'fenced' to forward all its fencing requests to stonith-ng. This allows us to configure all the real fencing devices in one place now using pacemaker. -- Vossel Is there any way that pacemaker can tell cman about it's fencing descisions -- Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Can in pass commands to Pacemaker from a Remote Machine
- Original Message - From: Puneet Jindal puneetjin...@drishti-soft.com To: pacemaker@oss.clusterlabs.org Sent: Sunday, November 24, 2013 2:58:35 AM Subject: [Pacemaker] Can in pass commands to Pacemaker from a Remote Machine Hello, I want to build a GUI on top of pacemaker, i configures remote-tls-port in cib and now cib is listening on that port. What all commands can i send to CIB and how to do that. Can anyone provide some examples look at the 'cibadmin' tool ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Use of Pacemaker to configure a new resource
- Original Message - From: Aarti Sawant aartipsawan...@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, November 20, 2013 3:55:40 AM Subject: [Pacemaker] Use of Pacemaker to configure a new resource Hello, I have a PostgreSQL replication set up. It consists of master A and standby B C which are directly connecting to master. I am using a tool called WANPRoxy for compression of data to be transferred between master and standby. In case of failure of master, I want WANProxy to run on any of the standbys. As far as I understand, this can be done by scripting OCF Resource Agent for WANProxy? Yes, you could make your own custom OCF script to manage WANProxy, or you could potentially have pacemaker start WANProxy using whatever init script is shipped with the daemon as well. So my primary question is that can Pacemaker be used to start WANProxy compression on different machine when one of my machine fails? yes, This sounds like basic resource failover, which is the primary reason people use Pacemaker. The command to start WANProxy is wanproxy -c /home/user_name/server_v2.conf Also , I want to know if by using some parameters in Resource Agents, can pacemaker also modify configuration files of compression tools like WANProxy, ssl ,ssh? Pacemaker simply executes resource-agent scripts and is not concerned with what those scripts do. If you want to modify configuration files for some reason, build that logic into your custom OCF scripts and pacemaker will execute. -- Vossel Thanks, Aarti Sawant NTTDATA OSS Center Pune ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Cannot use ocf::heartbeat:IPsrcaddr (RTNETLINK answers: No such process)
- Original Message - From: Mathieu Peltier mathieu.pelt...@gmail.com To: pacemaker@oss.clusterlabs.org Sent: Wednesday, November 6, 2013 11:27:50 AM Subject: [Pacemaker] Cannot use ocf::heartbeat:IPsrcaddr (RTNETLINK answers: No such process) Hi, I am trying to set up a simple cluster of 2 machines on CentOS 6.4: pacemaker-cli-1.1.10-1.el6_4.4.x86_64 pacemaker-1.1.10-1.el6_4.4.x86_64 pacemaker-libs-1.1.10-1.el6_4.4.x86_64 pacemaker-cluster-libs-1.1.10-1.el6_4.4.x86_64 corosync-1.4.1-15.el6_4.1.x86_64 corosynclib-1.4.1-15.el6_4.1.x86_64 pcs-0.9.90-1.el6_4.noarch cman-3.0.12.1-49.el6_4.2.x86_64 resource-agents-3.9.2-21.el6_4.8.x86_64 I am using the following script to configure the cluster: -- #!/bin/bash CLUSTER_NAME=test CONFIG_FILE=/etc/cluster/cluster.conf NODE1_EM1=node1 NODE2_EM1=node2 NODE1_EM2=node1-priv NODE2_EM2=node2-priv VIP=192.168.0.6 MONITOR_INTERVAL=60s # Make sure that pacemaker is stopped on both nodes # NOT INCLUDED HERE # Delete existing configuration rm -rf /var/log/cluster/* ssh root@$NODE2_EM2 'rm -rf /var/log/cluster/*' rm -rf /var/lib/pacemaker/cib/* /var/lib/pacemaker/cores/* /var/lib/pacemaker/pengine/* /var/lib/corosync/* /var/lib/cluster/* ssh root@$NODE2_EM2 'rm -rf /var/lib/pacemaker/cib/* /var/lib/pacemaker/cores/* /var/lib/pacemaker/pengine/* /var/lib/corosync/* /var/lib/cluster/*' # Create the cluster ccs -f $CONFIG_FILE --createcluster $CLUSTER_NAME # Add nodes to the cluster ccs -f $CONFIG_FILE --addnode $NODE1_EM1 ccs -f $CONFIG_FILE --addnode $NODE2_EM1 ccs -f $CONFIG_FILE --setcman two_node=1 expected_votes=1 # Add alternative nodes name so that both network interfaces are used ccs -f $CONFIG_FILE --addalt $NODE1_EM1 $NODE1_EM2 ccs -f $CONFIG_FILE --addalt $NODE2_EM1 $NODE2_EM2 ccs -f $CONFIG_FILE --setdlm protocol=sctp # Teach CMAN how to send it's fencing requests to Pacemaker ccs -f $CONFIG_FILE --addfencedev pcmk agent=fence_pcmk ccs -f $CONFIG_FILE --addmethod pcmk-redirect $NODE1_EM1 ccs -f $CONFIG_FILE --addmethod pcmk-redirect $NODE2_EM1 ccs -f $CONFIG_FILE --addfenceinst pcmk $NODE1_EM1 pcmk-redirect port=$NODE1_EM1 ccs -f $CONFIG_FILE --addfenceinst pcmk $NODE2_EM1 pcmk-redirect port=$NODE2_EM1 # Deploy configuration to node2 scp /etc/cluster/cluster.conf root@$NODE2_EM2:/etc/cluster/cluster.conf # Start pacemaker on main node /etc/init.d/pacemaker start sleep 30 # Disable stonith pcs property set stonith-enabled=false # Disable quorum pcs property set no-quorum-policy=ignore # Define ressources pcs resource create VIP_EM1 ocf:heartbeat:IPaddr params nic=em1 ip=$VIP_EM1 cidr_netmask=24 op monitor interval=$MONITOR_INTERVAL pcs resource create PREFERRED_SRC_IP ocf:heartbeat:IPsrcaddr params ipaddress=$VIP_EM1 op monitor interval=$MONITOR_INTERVAL # Define initial location and prevent ressources to go back to initial server after a failure pcs resource defaults resource-stickiness=100 pcs constraint location VIP_EM1 prefers $NODE1_EM1=50 -- After running this script from node1: root@node1# pcs status Cluster name: Last updated: Wed Nov 6 17:17:30 2013 Last change: Wed Nov 6 17:06:20 2013 via crm_attribute on node1 Stack: cman Current DC: node1 - partition with quorum Version: 1.1.10-1.el6_4.4-368c726 2 Nodes configured 2 Resources configured Online: [ node1 ] OFFLINE: [ node2 ] Full list of resources: VIP_EM1(ocf::heartbeat:IPaddr):Stopped PREFERRED_SRC_IP(ocf::heartbeat:IPsrcaddr):Stopped Failed actions: PREFERRED_SRC_IP_start_0 on node1 'unknown error' (1): call=19, status=complete, last-rc-change='Wed Nov 6 17:06:20 2013', queued=67ms, exec=0ms PCSD Status: Error: no nodes found in corosync.conf root@node1# ip route show 192.168.8.0/24 dev em2 proto kernel scope link src 192.168.8.1 default via 192.168.0.1 dev em1 Error in /var/log/cluster/corosync.log: ... IPsrcaddr(PREFERRED_SRC_IP)[638]: 2013/11/06_16:50:32 ERROR: command 'ip route change to default via 192.168.0.1 dev em1 src 192.168.0.6' failed Nov 06 16:50:32 [32461] node1.domain.org lrmd: notice: operation_finished: PREFERRED_SRC_IP_start_0:638:stderr [ RTNETLINK answers: No such process ] ... If I run the command manually when pacemaker is not started (after rebooting the machine for example), the default route is modified as expected (I use 192.168.0.106 because the alias 192.168.0.6 is not started) # ip route show 192.168.0.0/24 dev em1 proto kernel scope link src 192.168.0.106 192.168.8.0/24 dev em3 proto kernel scope link src 192.168.8.1 default via 192.168.0.1 dev em1 # ip route change to default via 192.168.0.1 dev em1 src 192.168.0.106 # ip route show 192.168.0.0/24 dev em1 proto kernel scope link src 192.168.0.106 192.168.8.0/24 dev em3 proto kernel
Re: [Pacemaker] Question about new migration logic
- Original Message - From: Kazunori INOUE kazunori.ino...@gmail.com To: pm pacemaker@oss.clusterlabs.org Sent: Friday, November 1, 2013 4:43:53 AM Subject: [Pacemaker] Question about new migration logic Hi David, Because I have a plan to test the function of migration in pacemaker-1.1, I am interested in this commit. https://github.com/davidvossel/pacemaker/commit/673e8599e4 If this new migration logic is merged into ClusterLabs, when do you think it will be merged? That patch is still incomplete. I'm pretty sure the idea I'm working with there will work, but it's not quite there yet. As far as when it will make it into Clusterlabs, that's hard to say. It has been a challenge to free up the day or two I need to finish it up. Hopefully I'll be able to work on it soon. -- Vossel Best Regards, Kazunori INOUE ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, October 30, 2013 1:08:12 AM Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints On 25 Oct 2013, at 9:40 am, David Vossel dvos...@redhat.com wrote: - Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, October 23, 2013 2:38:17 PM Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints David, The Infiniband network takes a nondeterministic amount of time to actually finish initializing, so we use ethmonitor to watch it; the OS is supposed to bring it up at boot time, but it moves on through the boot sequence without actually waiting for it. So in self defense we watch it with pacemaker. I guess I could restructure this to use a resource that brings up IB (with a really long time out) and use ordering to wait for that complete, but it seems that ethmonitor would be more adaptive to short-term IB network issues. Since ethmonitor works by setting an attribute (the RA running means it is watching the network, not that the network is up), I've used location constraints instead of ordering constraints. So I have completely restarted my cluster. Right now the physical nodes see each other, and the fencing agents are running. The first thing that should start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0 clones of the p-watch-ib0 primitive). They are not starting (like they used to). I see. Your cib generates an invalid transition. I'll try and look into it in more detail soon to understand the cause. According to git bisect, the winner is: I always knew I was a winner 15a86e501a57b50fdb3b8ce0ed432b183c343c74 is the first bad commit commit 15a86e501a57b50fdb3b8ce0ed432b183c343c74 Author: David Vossel dvos...@redhat.com Date: Mon Sep 23 18:55:21 2013 -0500 High: pengine: Probe container nodes I'll take a look in the morning unless David beats me to it :-) This is a tough one. I enabled probing container nodes, but didn't anticipate the scenario where there's an ordering constraint involving a container node's container resource (the VM). I have an idea of how to fix this, but the end result might make probing containers is useless. I'll give this some thought. Until then, there is a real easy workaround for this. Set the 'enable-container-probes' global config option to false -- Vossel One completely unrelated thought I had while looking at your config involves your fencing agents. You shouldn't have to use location constraints at on the fencing agents. I believe stonith is smart enough now to execute the agent on a node that isn't the target regardless of where the policy engine puts it. -- Vossel The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some slight editing to hide passwords in fencing agents). /Lindsay On Wed, Oct 23, 2013 at 11:20 AM, David Vossel dvos...@redhat.com wrote: - Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Tuesday, October 22, 2013 4:19:11 PM Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints I am getting rather unexpected behavior when I combine clones, location constraints, and remote nodes in an asymmetric cluster. My cluster is configured to be asymmetric, distinguishing between vmhosts and various sorts of remote nodes. Currently I am running upstream version b6d42ed. I am simplifying my description to avoid confusion, hoping in so doing I don't miss any salient points... My physical cluster nodes, also the VM hosts, have the attribute nodetype=vmhost. They also have Infiniband interfaces, which take some time to come up. I don't want my shared file system (which needs IB), or libvirtd (which needs the file system), to come up before IB... So I have this in my configuration: primitive p-watch-ib0 ocf:heartbeat:ethmonitor \ params \ interface=ib0 \ op monitor timeout=100s interval=10s clone c-watch-ib0 p-watch-ib0 \ meta interleave=true # location loc-watch-ib-only-vmhosts c-watch-ib0 \ rule 0: nodetype eq vmhost Something broke between upstream versions 0a2570a and c68919f -- the c-watch-ib0 clone never starts. I've found that if I run crm_resource --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0 attribute is not set like it used to be. Oh well, I can set it manually. So let's. A re-write of the attrd component was introduced around that time period. This should have been resolved at this point in the b6d42ed build
Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
- Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, October 23, 2013 2:38:17 PM Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints David, The Infiniband network takes a nondeterministic amount of time to actually finish initializing, so we use ethmonitor to watch it; the OS is supposed to bring it up at boot time, but it moves on through the boot sequence without actually waiting for it. So in self defense we watch it with pacemaker. I guess I could restructure this to use a resource that brings up IB (with a really long time out) and use ordering to wait for that complete, but it seems that ethmonitor would be more adaptive to short-term IB network issues. Since ethmonitor works by setting an attribute (the RA running means it is watching the network, not that the network is up), I've used location constraints instead of ordering constraints. So I have completely restarted my cluster. Right now the physical nodes see each other, and the fencing agents are running. The first thing that should start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0 clones of the p-watch-ib0 primitive). They are not starting (like they used to). I see. Your cib generates an invalid transition. I'll try and look into it in more detail soon to understand the cause. One completely unrelated thought I had while looking at your config involves your fencing agents. You shouldn't have to use location constraints at on the fencing agents. I believe stonith is smart enough now to execute the agent on a node that isn't the target regardless of where the policy engine puts it. -- Vossel The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some slight editing to hide passwords in fencing agents). /Lindsay On Wed, Oct 23, 2013 at 11:20 AM, David Vossel dvos...@redhat.com wrote: - Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Tuesday, October 22, 2013 4:19:11 PM Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints I am getting rather unexpected behavior when I combine clones, location constraints, and remote nodes in an asymmetric cluster. My cluster is configured to be asymmetric, distinguishing between vmhosts and various sorts of remote nodes. Currently I am running upstream version b6d42ed. I am simplifying my description to avoid confusion, hoping in so doing I don't miss any salient points... My physical cluster nodes, also the VM hosts, have the attribute nodetype=vmhost. They also have Infiniband interfaces, which take some time to come up. I don't want my shared file system (which needs IB), or libvirtd (which needs the file system), to come up before IB... So I have this in my configuration: primitive p-watch-ib0 ocf:heartbeat:ethmonitor \ params \ interface=ib0 \ op monitor timeout=100s interval=10s clone c-watch-ib0 p-watch-ib0 \ meta interleave=true # location loc-watch-ib-only-vmhosts c-watch-ib0 \ rule 0: nodetype eq vmhost Something broke between upstream versions 0a2570a and c68919f -- the c-watch-ib0 clone never starts. I've found that if I run crm_resource --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0 attribute is not set like it used to be. Oh well, I can set it manually. So let's. A re-write of the attrd component was introduced around that time period. This should have been resolved at this point in the b6d42ed build. We use GPFS for a shared file system, so I have an agent to start it and wait for a file system to mount. It should only run on VM hosts, and only when IB is running. So I have this: So the IB resource is setting some attribute that enables the fs to run? Why can't a ordering constraint be used here between IB and FS? primitive p-fs-gpfs ocf:ccni:gpfs \ params \ fspath=/gpfs/lb/utility \ op monitor timeout=20s interval=30s \ op start timeout=180s \ op stop timeout=120s clone c-fs-gpfs p-fs-gpfs \ meta interleave=true location loc-fs-gpfs-needs-ib0 c-fs-gpfs \ rule -inf: not_defined ethmonitor-ib0 or ethmonitor-ib0 eq 0 location loc-fs-gpfs-on-vmhosts c-fs-gpfs \ rule 0: nodetype eq vmhost That all used to start nicely. Now even if I set the ethmonitor-ib0 attribute, it doesn't. However, I can use crm_resource --force-start -r p-fs-gpfs on each of my VM hosts, then issue crm resource cleanup c-fs-gpfs, and all is well. I can use crm status to see something like: Last updated: Tue Oct 22 16:35:43 2013 Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01 Stack: cman Current DC: cvmh04 - partition with quorum Version: 1.1.10-19.el6.ccni-b6d42ed 8 Nodes configured 92
Re: [Pacemaker] libqb-0.16 instability with standby/unstandby ?
- Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, October 23, 2013 12:12:32 AM Subject: Re: [Pacemaker] libqb-0.16 instability with standby/unstandby ? On 23 Oct 2013, at 8:39 am, David Vossel dvos...@redhat.com wrote: - Original Message - From: Mike Pomraning m...@pilcrow.madison.wi.us To: pacemaker@oss.clusterlabs.org Sent: Tuesday, October 22, 2013 10:49:28 AM Subject: [Pacemaker] libqb-0.16 instability with standby/unstandby ? Regarding Justin Burnham's recent Pacemaker crash on node unstandby/standby[0] message, is anyone else seeing this behavior with libqb-0.16? I'm getting anecdotal reports of the same behavior from a team at work using RHEL-derived pcmk-1.1.8 and corosync-1.4.1 with libqb-0.16. Reverting to libqb-0.14 appears to have solved the issue. Sorry, I don't have enough to reproduce yet, but the similarities in symptoms are suggestive. FWIW, Justin also noted off list that his problems appear to have begun after updating to 0.16 a short time ago. I've tracked this down. Don't use libqb v0.16.0 with any pacemaker version less than 1.1.10. So this isn't just a question of older pacemaker versions needing a rebuild? A rebuild won't help There are multiple elements involved with this problem. Libqb had reference count leaks in 0.14.4, once those got resolved we discovered a race condition in pacemaker 1.1.8 that caused a double free... Ultimately the reference count leaks looked like they covered up the problem in pacemaker... Updating to libqb 0.16.0 when using pacemaker 1.1.8 exposes the race condition problem, which is what you all are seeing. -- Vossel -Mike [0] http://comments.gmane.org/gmane.linux.highavailability.pacemaker/19289 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
- Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Tuesday, October 22, 2013 4:19:11 PM Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints I am getting rather unexpected behavior when I combine clones, location constraints, and remote nodes in an asymmetric cluster. My cluster is configured to be asymmetric, distinguishing between vmhosts and various sorts of remote nodes. Currently I am running upstream version b6d42ed. I am simplifying my description to avoid confusion, hoping in so doing I don't miss any salient points... My physical cluster nodes, also the VM hosts, have the attribute nodetype=vmhost. They also have Infiniband interfaces, which take some time to come up. I don't want my shared file system (which needs IB), or libvirtd (which needs the file system), to come up before IB... So I have this in my configuration: primitive p-watch-ib0 ocf:heartbeat:ethmonitor \ params \ interface=ib0 \ op monitor timeout=100s interval=10s clone c-watch-ib0 p-watch-ib0 \ meta interleave=true # location loc-watch-ib-only-vmhosts c-watch-ib0 \ rule 0: nodetype eq vmhost Something broke between upstream versions 0a2570a and c68919f -- the c-watch-ib0 clone never starts. I've found that if I run crm_resource --force-start -r p-watch-ib0 when IB is running, the ethmonitor-ib0 attribute is not set like it used to be. Oh well, I can set it manually. So let's. A re-write of the attrd component was introduced around that time period. This should have been resolved at this point in the b6d42ed build. We use GPFS for a shared file system, so I have an agent to start it and wait for a file system to mount. It should only run on VM hosts, and only when IB is running. So I have this: So the IB resource is setting some attribute that enables the fs to run? Why can't a ordering constraint be used here between IB and FS? primitive p-fs-gpfs ocf:ccni:gpfs \ params \ fspath=/gpfs/lb/utility \ op monitor timeout=20s interval=30s \ op start timeout=180s \ op stop timeout=120s clone c-fs-gpfs p-fs-gpfs \ meta interleave=true location loc-fs-gpfs-needs-ib0 c-fs-gpfs \ rule -inf: not_defined ethmonitor-ib0 or ethmonitor-ib0 eq 0 location loc-fs-gpfs-on-vmhosts c-fs-gpfs \ rule 0: nodetype eq vmhost That all used to start nicely. Now even if I set the ethmonitor-ib0 attribute, it doesn't. However, I can use crm_resource --force-start -r p-fs-gpfs on each of my VM hosts, then issue crm resource cleanup c-fs-gpfs, and all is well. I can use crm status to see something like: Last updated: Tue Oct 22 16:35:43 2013 Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01 Stack: cman Current DC: cvmh04 - partition with quorum Version: 1.1.10-19.el6.ccni-b6d42ed 8 Nodes configured 92 Resources configured Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ] fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04 fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01 fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01 fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01 Clone Set: c-fs-gpfs [p-fs-gpfs] Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] which is what I would expect (other than I expect pacemaker to have started these for me, like it used to). Now I also have clone resources to NFS-mount another file system, and actually do a bind mount out of the GPFS file system, which behave like the GPFS resource -- they used to just work, now I need to use crm_resource --force-start and clean up. That finally lets me start libvirtd, using this configuration: primitive p-libvirtd lsb:libvirtd \ op monitor interval=30s clone c-p-libvirtd p-libvirtd \ meta interleave=true order o-libvirtd-after-storage inf: \ ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \ c-p-libvirtd location loc-libvirtd-on-vmhosts c-p-libvirtd \ rule 0: nodetype eq vmhost Of course that used to just work, but now, like the other clones, I need to force-start libvirtd on the VM hosts, and clean up. Once I do that, all my VM resources, which are not clones, just start up like they are supposed to! Several of these are configured as remote nodes, and they have services configured to run in them. But now other strange things happen: Last updated: Tue Oct 22 16:46:29 2013 Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01 Stack: cman Current DC: cvmh04 - partition with quorum Version: 1.1.10-19.el6.ccni-b6d42ed 8 Nodes configured 92 Resources configured ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline) Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ] Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ] fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04 fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01 fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01 fence-cvmh04 (stonith:fence_ipmilan): Started