Re: [Pacemaker] Cluster crash
On Tue, Mar 27, 2012 at 9:36 PM, Hugo Deprez hugo.dep...@gmail.com wrote: Hello, I do have no-quorum-policy=ignore What about fencing? Any idea how to reproduce this ? Regards, On 23 February 2012 23:43, Andrew Beekhof and...@beekhof.net wrote: On Thu, Feb 23, 2012 at 9:17 PM, Hugo Deprez hugo.dep...@gmail.com wrote: I don't think so has, I do have over similar cluster on the same network and didn't have any issues. The only thing I can detect was that the virtual machine was like unresponsive. But I think the VM crash was not like a power shutdown more like very slow then totaly crash. Even if the drbd-nagios resource timeout, it should failover on the other node no ? Not if no-quorum-policy wasn't set to ignore, or if we couldn't confirm the service is safely stopped on the original location. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] VirtualDomain Shutdown Timeout
On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin amar...@xes-inc.com wrote: Hello, I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so there is no shared storage, no live-migration): primitive p_vm ocf:heartbeat:VirtualDomain \ params config=/vmstore/config/vm.xml \ meta allow-migrate=false \ op start interval=0 timeout=180s \ op stop interval=0 timeout=120s \ op monitor interval=10 timeout=30 I would expect the following events to happen on failover on the from node (the migration source) if the VM hangs while shutting down: 1. VirtualDomain issues virsh shutdown vm to gracefully shutdown the VM 2. pacemaker waits 120 seconds for the timeout specified in the op stop timeout 3. VirtualDomain waits a bit less than 120 seconds to see if it will gracefully shutdown. Once it gets to almost 120 seconds, it issues virsh destroy vm to hard stop the VM. 4. pacemaker wakes up from the 120 second timeout and sees that the VM has stopped and proceeds with the failover However, I observed that VirtualDomain seems to be using the timeout from the op start line, 180 seconds, yet pacemaker uses the 120 second timeout. Thus, the VM is still running after the pacemaker timeout is reached and so the node is STONITHed. Here is the relevant section of code from /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: VirtualDomain_Stop() { local i local status local shutdown_timeout local out ex VirtualDomain_Status status=$? case $status in $OCF_SUCCESS) if ! ocf_is_true $OCF_RESKEY_force_stop; then # Issue a graceful shutdown request ocf_log info Issuing graceful shutdown request for domain ${DOMAIN_NAME}. virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} # The shutdown_timeout we use here is the operation # timeout specified in the CIB, minus 5 seconds shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) # Loop on status until we reach $shutdown_timeout while [ $NOW -lt $shutdown_timeout ]; do Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the op stop ... line? It should, however there was a bug in 1.1.6 where this wasn't the case. The relevant patch is: https://github.com/beekhof/pacemaker/commit/fcfe6fe Or you could try 1.1.7 How can I optimize my pacemaker configuration so that the VM will attempt to gracefully shutdown and then at worst case destroy the VM before the pacemaker timeout is reached? Moreover, is there anything I can do inside of the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown process? Thanks, Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] unable to join cluster
On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai osanai.hisa...@jp.fujitsu.com wrote: Hello, I have three nodes cluster using pacemaker/corosync. When I reboot one node, the node unable to join cluster. I can see that kind of split brain 10-20% (recall ration) if I shutdown a node. What do you think of this problem? It depends whether corosync sees all three nodes (in which case its a pacemaker problem), if not its a corosync problem. There are newer versions of both, perhaps try an upgrade? My questions are: - Is this known problem? - Any work around to avoid the this? - How can I solve this problem? [testserver001] Last updated: Sat Mar 10 14:18:49 2012 Stack: openais Current DC: NONE 3 Nodes configured, 3 expected votes 4 Resources configured. OFFLINE: [ testserver001 testserver002 testserver003 ] Migration summary: [testserver002] Last updated: Sat Mar 10 14:15:17 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002 (stonith:external/ipmi): Started testserver003 stonith-testserver003 (stonith:external/ipmi): Started testserver002 stonith-testserver001 (stonith:external/ipmi): Started testserver003 Migration summary: * Node testserver003: * Node testserver002: [testserver003] Last updated: Sat Mar 10 14:19:07 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002 (stonith:external/ipmi): Started testserver003 stonith-testserver003 (stonith:external/ipmi): Started testserver002 stonith-testserver001 (stonith:external/ipmi): Started testserver003 Migration summary: * Node testserver003: * Node testserver002: - Checked information + https://bugzilla.redhat.com/show_bug.cgi?id=525589 It looks the packages which I used already support this. + http://comments.gmane.org/gmane.linux.highavailability.user/36101 I checked entries in /etc/hosts but I didn't find out the wrong entry. === 127.0.0.1 testserver001 localhost ::1 localhost6.localdomain6 localhost6 === - Look into this from tcpdump OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends MESSAGE_TYPE_MCAST. I took the information from VMware env. + MESSAGE_TYPE_ORF_TOKEN No. Time Source Destination Protocol Length Info 119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2 UDP 112 Source port: 23489 Destination port: 23490 Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst: Vmware_8e:74:92 (00:0c:29:8e:74:92) Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst: 172.27.4.2 (172.27.4.2) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (70 bytes) 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00 ... 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b (snip) + MESSAGE_TYPE_MCAST No. Time Source Destination Protocol Length Info 5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16 UDP 1486 Source port: 23489 Destination port: 23490 Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured (11888 bits) Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst: IPv4mcast_5e:10:10 (01:00:5e:5e:10:10) Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: 226.94.16.16 (226.94.16.16) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (1444 bytes) 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b ... 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b (snip) NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see the message in pacemaker.log. + MESSAGE_TYPE_ORF_TOKEN No. Time Source Destination Protocol Length Info 39605 2012-03-10 14:18:13.826778 172.27.4.2 172.27.4.3 UDP 112 Source port: 23489 Destination port: 23490 Frame 39605: 112 bytes
Re: [Pacemaker] [Problem] The cluster fails in the stop of the node.
This appears to be resolved with 1.1.7, perhaps look for a patch to backport? On Tue, Mar 27, 2012 at 4:46 PM, renayama19661...@ybb.ne.jp wrote: Hi All, When we set a group resource within Master/Slave resource, we found the problem that a node could not stop. This problem occurs in Pacemaker1.0.11. We confirmed a problem in the following procedure. Step1) Start all nodes. Last updated: Tue Mar 27 14:35:16 2012 Stack: Heartbeat Current DC: test2 (b645c456-af78-429e-a40a-279ed063b97d) - partition WITHOUT quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. Online: [ test1 test2 ] Master/Slave Set: msGroup01 Masters: [ test1 ] Slaves: [ test2 ] Resource Group: testGroup prmDummy1 (ocf::pacemaker:Dummy): Started test1 prmDummy2 (ocf::pacemaker:Dummy): Started test1 Resource Group: grpStonith1 prmStonithN1 (stonith:external/ssh): Started test2 Resource Group: grpStonith2 prmStonithN2 (stonith:external/ssh): Started test1 Migration summary: * Node test2: * Node test1: Step2) Stop Slave node. [root@test2 ~]# service heartbeat stop Stopping High-Availability services: Done. Step3) Stop Master node. However, a loop does the Master node and does not stop. (snip) Mar 27 14:38:06 test1 crmd: [21443]: WARN: run_graph: Transition 3 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=23, Source=/var/lib/pengine/pe-input-3.bz2): Terminated Mar 27 14:38:06 test1 crmd: [21443]: ERROR: te_graph_trigger: Transition failed: terminated Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Graph 3 (30 actions in 30 synapses): batch-limit=30 jobs, network-delay=6ms Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 0 is pending (priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: [Action 12]: Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 14]: Completed (id: testMsGroup01:0_demote_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 32]: Pending (id: msGroup01_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 1 is pending (priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: [Action 13]: Pending (id: testMsGroup01:0_stopped_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 8]: Pending (id: prmStateful1:0_stop_0, loc: test1, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 9]: Pending (id: prmStateful2:0_stop_0, loc: test1, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 12]: Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 2 was confirmed (priority: 0) (snip) I attach data of hb_report. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Question about Pacemaker mysql master/slave replication and DRBD replication
On Thu, Mar 15, 2012 at 11:09 PM, coma coma@gmail.com wrote: Hello, I'm a new Pacemaker user and i trying to understand exactly what it can do / can't do in case of MySQL Replication or DRBD replication. I have two MySQl servers, for the moment with a simple Master/Slave replication, my need is to implement a high availability system with automated IP and DB fail-over / fail-back (including replication). I would like to be able to have 1 node designated as Master and in case of failure, automatically promote the slave node to master, and when the initial master will available again do the reverse operation automatically. I have compared several solutions and according to my needs (i have two servers only and i don't want / don't need use solutions like MySQL Cluster which requires 4 servers), Pacemaker seems the best solution in my case. I have to choose between Pacemaker with MySQL replication or Pacemaker with DRBD replication but it's difficult to find clear explanations and documentation about the 2 solutions, so i have some questions about it. If anyone can enlighten me i thank in advance! In the Pacemaker + MySQL replication case: I know pacemaker is able to do IP failover and it seems DB failover too(?), but what type of failure pacemaker can detect? I know it is able to monitor the network failure (node unreachable), but can it detect MySQL service failure and promote the slave to master? Yes. We call the mysql resource agent and react to any failures it detects. Example: Master node reachable, but database not (mysql service stopped, access failed-too many connexions, disk full, database read access but write error...)? Can pacemaker do (natively) the reverse operation automatically (When initial master node or MySQL DB will be available again)? Yes. This is what the http://www.mysqlperformanceblog.com blog is talking about. Pacemaker doesn't actually understand what it is managing, the details are hidden by the RA scripts. In this case, can it manage the replication? and if not, can i use a personal shell scipt to do it? Else, i have browsing the maillinglist archives and i've seen the Percona Replication Manager solution (http://www.mysqlperformanceblog.com/2011/11/29/percona-replication-manager-a-solution-for-mysql-high-availability-with-replication-using-pacemaker/), somebody he already implemented it in production environment? Is it reliable? I've not used it myself but I hear good things. In the Pacemaker + DRBD replication case: I understand that pacemaker and drbd work very well together and drbd is a good solution for mysql replication. In case of master (active node) failure, pacemaker+DRBD promote automatically the slave (passive node) as master and i have read that drbd can handle itself the back replication when master node is available again? can you enlighten me a little more about it? I also read that it is recommended to have a dedicated network for drbd replication, but can i do without? I don't write a lot on my databases, reading much more, so replication will not be a big charge. The big problem i have with DRBD is that i work on RHEL5 and i have read that i will have to recompile DRBD after each kernel update (there is not a lot of updates but still some), is it possible to avoid it? (CentOS DRBD packages maybe?) Somebody has already been problems with RHEL updates and DRBD/Pacemaker? RHEL5 is getting on a bit, its version of glib is too old to even build pacemaker there anymore. What about using RHEL6 which even ships with pacemaker? Thank you in advance for any response, and sorry if my english is not very good. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [PATCH] pingd checks pidfile on start
Any chance you could redo this as a github pull request? :-D On Wed, Mar 14, 2012 at 6:49 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Hi I use pacemaker 1.0.11 and pingd RA. Occasionally, pingd's first monitor is failed after start. It seems that the main cause is pingd daemon returns 0 before creating pidfile and RA doesn't check pidfile on start. test script - while true; do killall pingd; sleep 3 rm -f /tmp/pingd.pid; sleep 1 /usr/lib64/heartbeat/pingd -D -p /tmp/pingd.pid -a ping_status -d 0 -m 100 -h 192.168.0.1 echo $? ls /tmp/pingd.pid; sleep .1 ls /tmp/pingd.pid done - result - 0 /tmp/pingd.pid /tmp/pingd.pid 0 ls: cannot access /tmp/pingd.pid: No such file or directory - NG /tmp/pingd.pid 0 /tmp/pingd.pid /tmp/pingd.pid 0 /tmp/pingd.pid /tmp/pingd.pid 0 /tmp/pingd.pid /tmp/pingd.pid 0 ls: cannot access /tmp/pingd.pid: No such file or directory - NG /tmp/pingd.pid -- Please consider the attached patch for pacemaker-1.0. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] problem: stonith executes stonith command remote on the dead host and not local
On Tue, Mar 6, 2012 at 5:12 AM, Thomas Boernert t...@tbits.net wrote: Hi again, should i open a bug report about this issue ? Which distro is this? Any chance to try 1.1.7? Thanks Thomas Thomas Börnert schrieb am 02.03.2012 um 12:06 Uhr Hi List, my problem is that stonith will execute the command to fence on the remote dead host and not on the local machine :-(. this will end with an timeout. some facts: - 2 node cluster with 2 dell servers - each server have an own drac card - pacemaker 1.1.6 - heartbeat 3.0.4 - corosync 1.4.1 node1 should fence node2 if node2 is dead and node2 should fence node1 if node1 is dead it works fine manual with the stonith script fence_drac5 my config -- snip node node1 \ attributes standby=off node node2 \ attributes standby=off primitive httpd ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf port=80 \ op start interval=0 timeout=60s \ op monitor interval=5s timeout=20s \ op stop interval=0 timeout=60s primitive node1-stonith stonith:fence_drac5 \ params ipaddr=192.168.1.101 login=root passwd=1234 action=reboot secure=true cmd_prompt=admin1- power_wait=300 pcmk_host_list=node1 primitive node2-stonith stonith:fence_drac5 \ params ipaddr=192.168.1.102 login=root passwd=1234 action=reboot secure=true cmd_prompt=admin1- power_wait=300 pcmk_host_list=node2 primitive nodeIP ocf:heartbeat:IPaddr2 \ op monitor interval=60 timeout=20 \ params ip=192.168.1.10 cidr_netmask=24 nic=eth0:0 broadcast=192.168.1.255 primitive nodeIParp ocf:heartbeat:SendArp \ params ip=192.168.1.10 nic=eth0:0 group WebServices nodeIP nodeIParp httpd location node1-stonith-log node1-stonith -inf: node1 location node2-stonith-log node2-stonith -inf: node2 property $id=cib-bootstrap-options \ dc-version=1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=true \ no-quorum-policy=ignore \ last-lrm-refresh=1330685786 -- snip [root@node2 ~]# stonith_admin -l node1 node1-stonith 1 devices found it seems ok now i try [root@node2 ~]# stonith_admin -V -F node1 stonith_admin[5685]: 2012/03/02_13:00:44 debug: main: Create stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/st_command stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token: Obtained registration token: 6258828b-4b19-472f-9256-8da36fe87962 stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/st_callback stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token: Obtained registration token: 6266ebb8-2112-4378-a00c-3eaff47c9a9d stonith_admin[5685]: 2012/03/02_13:00:44 debug: stonith_api_signon: Connection to STONITH successful stonith_admin[5685]: 2012/03/02_13:00:44 debug: main: Connect: 0 Command failed: Operation timed out stonith_admin[5685]: 2012/03/02_13:00:56 debug: stonith_api_signoff: Signing out of the STONITH Service stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Disconnect: -8 stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Destroy the log on node2 shows: --- snip --- Mar 2 13:00:58 node2 crmd: [2665]: info: te_fence_node: Executing reboot fencing operation (21) on node1 (timeout=6) Mar 2 13:00:58 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 3325df94-8d59-4c00-a37e-be31e79f7503 Mar 2 13:00:58 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0 --- snip --- why remote on the dead host ? Thanks Thomas the complete log --- snip --- Mar 2 13:00:44 node2 stonith_admin: [5685]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root Mar 2 13:00:44 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation off for node1: 7d8beca4-1853-44fd-9bb2-4015b080c37b Mar 2 13:00:44 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0 Mar 2 13:00:46 node2 stonith-ng: [2660]: ERROR: remote_op_query_timeout: Query 561e89af-6f5a-45cb-adc2-45389940f1db for node1 timed out Mar 2 13:00:46 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action reboot (561e89af-6f5a-45cb-adc2-45389940f1db) for node1 timed out Mar 2 13:00:46 node2 stonith-ng: [2660]: info: remote_op_done: Notifing clients of 561e89af-6f5a-45cb-adc2-45389940f1db (reboot of
Re: [Pacemaker] how to using rules to control resource option?
try reading clusters from scratch 2012/3/26 sinchb sin...@163.com: how to write the command when I want ro using some rules to control resource option(like resource-sticikiness).how to add a rule and a score to the instance attribute? I am looking forward to your reply! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Migration of lower resource causes dependent resources to restart
Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Migration of lower resource causes dependent resources to restart
On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Is that strictly true? Always? My understanding was although A thinks the migration happens instantaneously, it is in fact more likely to be pause+migrate+resume and during that time anyone trying to talk to A during that time is going to be disappointed. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Resource Agent ethmonitor
Il 21/03/2012 09:06, Florian Haas ha scritto: On Tue, Mar 20, 2012 at 4:18 PM, Fiorenza Meinifme...@esseweb.eu wrote: Hi there, has anybody configured successfully the RA specified in the object of the message? I got this error: if_eth0_monitor_0 (node=fw1, call=2297, rc=-2, status=Timed Out): unknown exec error Your ethmonitor RA missed its 50-second timeout on the probe (that is, the initial monitor operation). You should be seeing Monitoring of if_eth0 failed, X retries left warnings in your logs. Grepping your syslog for ethmonitor will probably turn up some useful results. Cheers, Florian Thank you, I solved the problem. Regards -- Fiorenza Meini Spazio Web S.r.l. V. Dante Alighieri, 10 - 13900 Biella Tel.: 015.2431982 - 015.9526066 Fax: 015.2522600 Reg. Imprese, CF e P.I.: 02414430021 Iscr. REA: BI - 188936 Iscr. CCIAA: Biella - 188936 Cap. Soc.: 30.000,00 Euro i.v. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Migration of lower resource causes dependent resources to restart
29.03.2012 09:35, Andrew Beekhof wrote: On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Is that strictly true? Always? This probably depends on implementation. With qemu live migration - yes. With pacemaker:Dummy (with meta allow-migrate=true) probably yes too... My understanding was although A thinks the migration happens instantaneously, it is in fact more likely to be pause+migrate+resume and during that time anyone trying to talk to A during that time is going to be disappointed. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Migration of lower resource causes dependent resources to restart
29.03.2012 09:43, Vladislav Bogdanov wrote: 29.03.2012 09:35, Andrew Beekhof wrote: On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Is that strictly true? Always? This probably depends on implementation. With qemu live migration - yes. With pacemaker:Dummy (with meta allow-migrate=true) probably yes too... And if RA is just a management for some external entity - then yes too (although this case is probably not very common ;) ). My understanding was although A thinks the migration happens instantaneously, it is in fact more likely to be pause+migrate+resume and during that time anyone trying to talk to A during that time is going to be disappointed. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Migration of lower resource causes dependent resources to restart
On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 29.03.2012 09:35, Andrew Beekhof wrote: On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Is that strictly true? Always? This probably depends on implementation. With qemu live migration - yes. So there will be no point at which, for example, pinging the VM's ip address fails? With pacemaker:Dummy (with meta allow-migrate=true) probably yes too... My understanding was although A thinks the migration happens instantaneously, it is in fact more likely to be pause+migrate+resume and during that time anyone trying to talk to A during that time is going to be disappointed. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Migration of lower resource causes dependent resources to restart
29.03.2012 10:07, Andrew Beekhof wrote: On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 29.03.2012 09:35, Andrew Beekhof wrote: On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, Pacemaker restarts resources when resource they depend on (ordering only, no colocation) is migrated. I mean that when I do crm resource migrate lustre, I get LogActions: Migrate lustre#011(Started lustre03-left - lustre04-left) LogActions: Restart mgs#011(Started lustre01-left) I only have one ordering constraint for these two resources: order mgs-after-lustre inf: lustre:start mgs:start This reminds me what have been with reload in a past (dependent resource restart when lower resource is reloaded). Shouldn't this be changed? Migration usually means that service is not interrupted... Is that strictly true? Always? This probably depends on implementation. With qemu live migration - yes. So there will be no point at which, for example, pinging the VM's ip address fails? Even all existing connections are preserved. Small delays during last migration phase are still possible, but they are minor (during around 100-200 milliseconds while context is switching and ip is announced from another node). And packets are not lost, just delayed a bit. I have corosync/pacemaker udpu clusters in VMs, and even corosync is happy when VM it runs on is migrating to another node (with some token tuning). With pacemaker:Dummy (with meta allow-migrate=true) probably yes too... My understanding was although A thinks the migration happens instantaneously, it is in fact more likely to be pause+migrate+resume and during that time anyone trying to talk to A during that time is going to be disappointed. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] CIB not saved
Hi there, a strange thing happened to my two node cluster: I rebooted both machine at the same time, when s.o. went up again, no resources were configured anymore: as it was a fresh installation. Why ? It was explained to me that the configuration of resources managed by pacemaker should be in a file called cib.xml, but cannot find it in the system. Have I to specify any particular option in the configuration file? Thanks and regards -- Fiorenza Meini Spazio Web S.r.l. V. Dante Alighieri, 10 - 13900 Biella Tel.: 015.2431982 - 015.9526066 Fax: 015.2522600 Reg. Imprese, CF e P.I.: 02414430021 Iscr. REA: BI - 188936 Iscr. CCIAA: Biella - 188936 Cap. Soc.: 30.000,00 Euro i.v. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Issue with ordering
Hi Andrew, all, I'm continuing experiments with lustre on stacked drbd, and see following problem: I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of other (ms-drbd-testfs-mdt-left), and have following constraints between them: colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf: ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf: ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start Then I have filesystem mounted on top of ms-drbd-testfs-mdt (testfs-mdt resource). colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt ms-drbd-testfs-mdt:Master order testfs-mdt-after-drbd-testfs-mdt inf: ms-drbd-testfs-mdt:promote testfs-mdt:start When I trigger event which causes many resources to stop (including these three), LogActions output look like: LogActions: Stopdrbd-local#011(lustre01-left) LogActions: Stopdrbd-stacked#011(Started lustre02-left) LogActions: Stopdrbd-testfs-local#011(Started lustre03-left) LogActions: Stopdrbd-testfs-stacked#011(Started lustre04-left) LogActions: Stoplustre#011(Started lustre04-left) LogActions: Stopmgs#011(Started lustre01-left) LogActions: Stoptestfs#011(Started lustre03-left) LogActions: Stoptestfs-mdt#011(Started lustre01-left) LogActions: Stoptestfs-ost#011(Started lustre01-left) LogActions: Stoptestfs-ost0001#011(Started lustre02-left) LogActions: Stoptestfs-ost0002#011(Started lustre03-left) LogActions: Stoptestfs-ost0003#011(Started lustre04-left) LogActions: Stopdrbd-mgs:0#011(Master lustre01-left) LogActions: Stopdrbd-mgs:1#011(Slave lustre02-left) LogActions: Stopdrbd-testfs-mdt:0#011(Master lustre01-left) LogActions: Stopdrbd-testfs-mdt-left:0#011(Master lustre01-left) LogActions: Stopdrbd-testfs-mdt-left:1#011(Slave lustre02-left) LogActions: Stopdrbd-testfs-ost:0#011(Master lustre01-left) LogActions: Stopdrbd-testfs-ost-left:0#011(Master lustre01-left) LogActions: Stopdrbd-testfs-ost-left:1#011(Slave lustre02-left) LogActions: Stopdrbd-testfs-ost0001:0#011(Master lustre02-left) LogActions: Stopdrbd-testfs-ost0001-left:0#011(Master lustre02-left) LogActions: Stopdrbd-testfs-ost0001-left:1#011(Slave lustre01-left) LogActions: Stopdrbd-testfs-ost0002:0#011(Master lustre03-left) LogActions: Stopdrbd-testfs-ost0002-left:0#011(Master lustre03-left) LogActions: Stopdrbd-testfs-ost0002-left:1#011(Slave lustre04-left) LogActions: Stopdrbd-testfs-ost0003:0#011(Master lustre04-left) LogActions: Stopdrbd-testfs-ost0003-left:0#011(Master lustre04-left) LogActions: Stopdrbd-testfs-ost0003-left:1#011(Slave lustre03-left) For some reason demote is not run on both mdt drbd esources (should it?), so drbd RA prints warning about that. What I see then is that ms-drbd-testfs-mdt-left is tried to stop before ms-drbd-testfs-mdt. More, testfs-mdt filesystem resource is not stopped before stopping drbd-testfs-mdt. I have advisory ordering constraints between mdt and ost filesystem resources, so all ost's are stopped before mdt. Thus mdt stop is delayed a bit. May be this influences what happens. I'm pretty sure I have correct constraints for at least these three resources, so it looks like a bug, because mandatory ordering is not preserved. I can produce report for this. Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CIB not saved
On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini fme...@esseweb.eu wrote: Hi there, a strange thing happened to my two node cluster: I rebooted both machine at the same time, when s.o. went up again, no resources were configured anymore: as it was a fresh installation. Why ? It was explained to me that the configuration of resources managed by pacemaker should be in a file called cib.xml, but cannot find it in the system. Have I to specify any particular option in the configuration file? Normally you shouldn't worry about it. cib.xml is stored in /var/lib/heartbeat/crm/ or similar and the directory should have have hacluster:haclient permissions. What distro is it and how did you install it? Rasto -- Dipl.-Ing. Rastislav Levrinc rasto.levr...@gmail.com Linux Cluster Management Console http://lcmc.sf.net/ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Issue with ordering
On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, I'm continuing experiments with lustre on stacked drbd, and see following problem: At the risk of going off topic, can you explain *why* you want to do this? If you need a distributed, replicated filesystem with asynchronous replication capability (the latter presumably for DR), why not use a Distributed-Replicated GlusterFS volume with geo-replication? Note that I know next to nothing about your actual detailed requirements, so GlusterFS may well be non-ideal for you and my suggestion may thus be moot, but it would be nice if you could explain why you're doing this. Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Issue with ordering
Hi Florian, 29.03.2012 11:54, Florian Haas wrote: On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, I'm continuing experiments with lustre on stacked drbd, and see following problem: At the risk of going off topic, can you explain *why* you want to do this? If you need a distributed, replicated filesystem with asynchronous replication capability (the latter presumably for DR), why not use a Distributed-Replicated GlusterFS volume with geo-replication? I need fast POSIX fs scalable to tens of petabytes with support for fallocate() and friends to prevent fragmentation. I generally agree with Linus about FUSE and userspace filesystems in general, so that is not an option. Using any API except what VFS provides via syscalls+glibc is not an option too because I need access to files from various scripted languages including shell and directly from a web server written in C. Having bindings for them all is a real overkill. And it all is in userspace again. So I generally have choice of CEPH, Lustre, GPFS and PVFS. CEPH is still very alpha, so I can't rely on it, although I keep my eye on it. GPFS is not an option because it is not free and produced by IBM (can't say which of these two is more important ;) ) Can't remember why exactly PVFS is a no-go, their site is down right now. Probably userspace server implementation (although some examples like nfs server discredit idea of in-kernel servers, I still believe this is a way to go). Lustre is widely deployed, predictable and stable. It fully runs in kernel space. Although Oracle did its best to bury Lustre development, it is actively developed by whamcloud and company. They have builds for EL6, so I'm pretty happy with this. Lustre doesn't have any replication built-in so I need to add it on a lower layer (no rsync, no rsync, no rsync ;) ). DRBD suits my needs for a simple HA. But I also need datacenter-level HA, that's why I evaluate stacked DRBD and tickets with booth. So, frankly speaking, I decided to go with Lustre not because it is so cool (it has many-many niceties), but because all others I know do not suit my needs at all due to various reasons. Hope this clarifies my point, Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CIB not saved
Il 29/03/2012 10:12, Rasto Levrinc ha scritto: On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meinifme...@esseweb.eu wrote: Hi there, a strange thing happened to my two node cluster: I rebooted both machine at the same time, when s.o. went up again, no resources were configured anymore: as it was a fresh installation. Why ? It was explained to me that the configuration of resources managed by pacemaker should be in a file called cib.xml, but cannot find it in the system. Have I to specify any particular option in the configuration file? Normally you shouldn't worry about it. cib.xml is stored in /var/lib/heartbeat/crm/ or similar and the directory should have have hacluster:haclient permissions. What distro is it and how did you install it? Rasto Thanks, it was a permission problems. Regards -- Fiorenza Meini Spazio Web S.r.l. V. Dante Alighieri, 10 - 13900 Biella Tel.: 015.2431982 - 015.9526066 Fax: 015.2522600 Reg. Imprese, CF e P.I.: 02414430021 Iscr. REA: BI - 188936 Iscr. CCIAA: Biella - 188936 Cap. Soc.: 30.000,00 Euro i.v. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Pacemaker + Oracle
Hi, I'm working with Pacemaker Active Passive Cluster and need to use oracle as a resource to the pacemaker. my resource script is crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op monitor inetrval=120s but it is not worked for me. Can someone help out on this matter? Regards, Ruwan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker + Oracle
cat /etc/oratab And maybe you can post your log :-) Il giorno 29 marzo 2012 13:53, Ruwan Fernando ruwanm...@gmail.com ha scritto: Hi, I'm working with Pacemaker Active Passive Cluster and need to use oracle as a resource to the pacemaker. my resource script is crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op monitor inetrval=120s but it is not worked for me. Can someone help out on this matter? Regards, Ruwan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Issue with ordering
On Thu, Mar 29, 2012 at 11:40 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Florian, 29.03.2012 11:54, Florian Haas wrote: On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, I'm continuing experiments with lustre on stacked drbd, and see following problem: At the risk of going off topic, can you explain *why* you want to do this? If you need a distributed, replicated filesystem with asynchronous replication capability (the latter presumably for DR), why not use a Distributed-Replicated GlusterFS volume with geo-replication? I need fast POSIX fs scalable to tens of petabytes with support for fallocate() and friends to prevent fragmentation. I generally agree with Linus about FUSE and userspace filesystems in general, so that is not an option. I generally agree with Linus and just about everyone else that filesystems shouldn't require invasive core kernel patches. But I digress. :) Using any API except what VFS provides via syscalls+glibc is not an option too because I need access to files from various scripted languages including shell and directly from a web server written in C. Having bindings for them all is a real overkill. And it all is in userspace again. So I generally have choice of CEPH, Lustre, GPFS and PVFS. CEPH is still very alpha, so I can't rely on it, although I keep my eye on it. GPFS is not an option because it is not free and produced by IBM (can't say which of these two is more important ;) ) Can't remember why exactly PVFS is a no-go, their site is down right now. Probably userspace server implementation (although some examples like nfs server discredit idea of in-kernel servers, I still believe this is a way to go). Ceph is 100% userspace server side, jftr. :) And it has no async replication capability at this point, which you seem to be after. Lustre is widely deployed, predictable and stable. It fully runs in kernel space. Although Oracle did its best to bury Lustre development, it is actively developed by whamcloud and company. They have builds for EL6, so I'm pretty happy with this. Lustre doesn't have any replication built-in so I need to add it on a lower layer (no rsync, no rsync, no rsync ;) ). DRBD suits my needs for a simple HA. But I also need datacenter-level HA, that's why I evaluate stacked DRBD and tickets with booth. So, frankly speaking, I decided to go with Lustre not because it is so cool (it has many-many niceties), but because all others I know do not suit my needs at all due to various reasons. Hope this clarifies my point, It does. Doesn't necessarily mean I agree, but the point you're making is fine. Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] VirtualDomain Shutdown Timeout
Hi Andrew, Thanks, that sounds good. I am using the Ubuntu HA ppa, so I will wait for a 1.1.7 package to become available. Andrew - Original Message - From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, March 29, 2012 1:08:21 AM Subject: Re: [Pacemaker] VirtualDomain Shutdown Timeout On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin amar...@xes-inc.com wrote: Hello, I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so there is no shared storage, no live-migration): primitive p_vm ocf:heartbeat:VirtualDomain \ params config=/vmstore/config/vm.xml \ meta allow-migrate=false \ op start interval=0 timeout=180s \ op stop interval=0 timeout=120s \ op monitor interval=10 timeout=30 I would expect the following events to happen on failover on the from node (the migration source) if the VM hangs while shutting down: 1. VirtualDomain issues virsh shutdown vm to gracefully shutdown the VM 2. pacemaker waits 120 seconds for the timeout specified in the op stop timeout 3. VirtualDomain waits a bit less than 120 seconds to see if it will gracefully shutdown. Once it gets to almost 120 seconds, it issues virsh destroy vm to hard stop the VM. 4. pacemaker wakes up from the 120 second timeout and sees that the VM has stopped and proceeds with the failover However, I observed that VirtualDomain seems to be using the timeout from the op start line, 180 seconds, yet pacemaker uses the 120 second timeout. Thus, the VM is still running after the pacemaker timeout is reached and so the node is STONITHed. Here is the relevant section of code from /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: VirtualDomain_Stop() { local i local status local shutdown_timeout local out ex VirtualDomain_Status status=$? case $status in $OCF_SUCCESS) if ! ocf_is_true $OCF_RESKEY_force_stop; then # Issue a graceful shutdown request ocf_log info Issuing graceful shutdown request for domain ${DOMAIN_NAME}. virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} # The shutdown_timeout we use here is the operation # timeout specified in the CIB, minus 5 seconds shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) # Loop on status until we reach $shutdown_timeout while [ $NOW -lt $shutdown_timeout ]; do Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the op stop ... line? It should, however there was a bug in 1.1.6 where this wasn't the case. The relevant patch is: https://github.com/beekhof/pacemaker/commit/fcfe6fe Or you could try 1.1.7 How can I optimize my pacemaker configuration so that the VM will attempt to gracefully shutdown and then at worst case destroy the VM before the pacemaker timeout is reached? Moreover, is there anything I can do inside of the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown process? Thanks, Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}
On Fri, Mar 30, 2012 at 1:47 AM, Florian Haas flor...@hastexo.com wrote: Lars (lmb), or Andrew -- maybe one of you remembers what this was all about. In this commit, Lars enabled the OCF_RESKEY_CRM_meta_{ordered,notify,interleave} attributes to be injected into the environment of RAs: https://github.com/ClusterLabs/pacemaker/commit/b0ba01f61086f073be69db3e6beb0914642f79d9 Then that change was almost immediately backed out: https://github.com/ClusterLabs/pacemaker/commit/b33d3bf5376ab59baa435086c803b9fdaf6de504 Because it was felt that RAs shouldn't need to know. Those options change pacemaker's behaviour, not the RAs. But subsequently, in lf#2391, you convinced us to add notify since it allowed the drbd agent to error out if they were not turned on. And since then, at some point evidently only interleave and notify made it back in. Any specific reason for omitting ordered? I happen to have a pretty good use case for an ordered-clone RA, and it would be handy to be able to test whether clone ordering has been enabled. I'd need more information. The RA shouldn't need to care I would have thought. The ordering happens in the PE/crmd, the RA should just do what its told. All insights are much appreciated. Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CIB not saved
On Thu, Mar 29, 2012 at 8:45 PM, Fiorenza Meini fme...@esseweb.eu wrote: Il 29/03/2012 10:12, Rasto Levrinc ha scritto: On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meinifme...@esseweb.eu wrote: Hi there, a strange thing happened to my two node cluster: I rebooted both machine at the same time, when s.o. went up again, no resources were configured anymore: as it was a fresh installation. Why ? It was explained to me that the configuration of resources managed by pacemaker should be in a file called cib.xml, but cannot find it in the system. Have I to specify any particular option in the configuration file? Normally you shouldn't worry about it. cib.xml is stored in /var/lib/heartbeat/crm/ or similar and the directory should have have hacluster:haclient permissions. What distro is it and how did you install it? Rasto Thanks, it was a permission problems. Normally we log an error at startup if we can't write there... did this not happen? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Issue with ordering
On Thu, Mar 29, 2012 at 7:07 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: Hi Andrew, all, I'm continuing experiments with lustre on stacked drbd, and see following problem: I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of other (ms-drbd-testfs-mdt-left), and have following constraints between them: colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf: ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf: ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start Then I have filesystem mounted on top of ms-drbd-testfs-mdt (testfs-mdt resource). colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt ms-drbd-testfs-mdt:Master order testfs-mdt-after-drbd-testfs-mdt inf: ms-drbd-testfs-mdt:promote testfs-mdt:start When I trigger event which causes many resources to stop (including these three), LogActions output look like: LogActions: Stop drbd-local#011(lustre01-left) LogActions: Stop drbd-stacked#011(Started lustre02-left) LogActions: Stop drbd-testfs-local#011(Started lustre03-left) LogActions: Stop drbd-testfs-stacked#011(Started lustre04-left) LogActions: Stop lustre#011(Started lustre04-left) LogActions: Stop mgs#011(Started lustre01-left) LogActions: Stop testfs#011(Started lustre03-left) LogActions: Stop testfs-mdt#011(Started lustre01-left) LogActions: Stop testfs-ost#011(Started lustre01-left) LogActions: Stop testfs-ost0001#011(Started lustre02-left) LogActions: Stop testfs-ost0002#011(Started lustre03-left) LogActions: Stop testfs-ost0003#011(Started lustre04-left) LogActions: Stop drbd-mgs:0#011(Master lustre01-left) LogActions: Stop drbd-mgs:1#011(Slave lustre02-left) LogActions: Stop drbd-testfs-mdt:0#011(Master lustre01-left) LogActions: Stop drbd-testfs-mdt-left:0#011(Master lustre01-left) LogActions: Stop drbd-testfs-mdt-left:1#011(Slave lustre02-left) LogActions: Stop drbd-testfs-ost:0#011(Master lustre01-left) LogActions: Stop drbd-testfs-ost-left:0#011(Master lustre01-left) LogActions: Stop drbd-testfs-ost-left:1#011(Slave lustre02-left) LogActions: Stop drbd-testfs-ost0001:0#011(Master lustre02-left) LogActions: Stop drbd-testfs-ost0001-left:0#011(Master lustre02-left) LogActions: Stop drbd-testfs-ost0001-left:1#011(Slave lustre01-left) LogActions: Stop drbd-testfs-ost0002:0#011(Master lustre03-left) LogActions: Stop drbd-testfs-ost0002-left:0#011(Master lustre03-left) LogActions: Stop drbd-testfs-ost0002-left:1#011(Slave lustre04-left) LogActions: Stop drbd-testfs-ost0003:0#011(Master lustre04-left) LogActions: Stop drbd-testfs-ost0003-left:0#011(Master lustre04-left) LogActions: Stop drbd-testfs-ost0003-left:1#011(Slave lustre03-left) For some reason demote is not run on both mdt drbd esources (should it?), so drbd RA prints warning about that. So its not just a logging error, the demote really isn't scheduled? That would be bad, can you file a bug please? What I see then is that ms-drbd-testfs-mdt-left is tried to stop before ms-drbd-testfs-mdt. More, testfs-mdt filesystem resource is not stopped before stopping drbd-testfs-mdt. I have advisory ordering constraints between mdt and ost filesystem resources, so all ost's are stopped before mdt. Thus mdt stop is delayed a bit. May be this influences what happens. I'm pretty sure I have correct constraints for at least these three resources, so it looks like a bug, because mandatory ordering is not preserved. I can produce report for this. Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Patch]Patch for crmd-transition-delay processing.
Hi Andrew, Thank you for comment. The patch makes sense, could you resend as a github pull request? :-D All right!! I send it if ready. Please wait Best Regards, Hideo Yamauchi. --- On Thu, 2012/3/29, Andrew Beekhof and...@beekhof.net wrote: The patch makes sense, could you resend as a github pull request? :-D On Thu, Mar 22, 2012 at 8:18 PM, renayama19661...@ybb.ne.jp wrote: Hi All, Sorry My patch was wrong. I send a right patch. Best Regards, Hideo Yamauchi. --- On Thu, 2012/3/22, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi All, The crmd-transition-delay waits for the update of the attribute to be late. However, crmd cannot realize the wait of the attribute well because a timer is not reset when the delay of the attribute occurs after a timer was set. As a result, the resource may not be placed definitely. I wrote a patch for Pacemaker 1.0.12. And this patch blocks the handling of tengine when a crmd-transition-delay timer is set. And tengine handles instructions of pengine after a crmd-transition-delay timer exercised it definitely. By this patch, the start of the resource may be late. However, it realizes the placement of a right resource depending on limitation. * I think that the similar correction is necessary for a development version of Pacemaker. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem] The cluster fails in the stop of the node.
Hi Andrew, This appears to be resolved with 1.1.7, perhaps look for a patch to backport? I confirm movement of Pacemaker 1.1.7. And I talk about the backporting with Mr Mori. Best Regards, Hideo Yamauchi. --- On Thu, 2012/3/29, Andrew Beekhof and...@beekhof.net wrote: This appears to be resolved with 1.1.7, perhaps look for a patch to backport? On Tue, Mar 27, 2012 at 4:46 PM, renayama19661...@ybb.ne.jp wrote: Hi All, When we set a group resource within Master/Slave resource, we found the problem that a node could not stop. This problem occurs in Pacemaker1.0.11. We confirmed a problem in the following procedure. Step1) Start all nodes. Last updated: Tue Mar 27 14:35:16 2012 Stack: Heartbeat Current DC: test2 (b645c456-af78-429e-a40a-279ed063b97d) - partition WITHOUT quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 4 Resources configured. Online: [ test1 test2 ] Master/Slave Set: msGroup01 Masters: [ test1 ] Slaves: [ test2 ] Resource Group: testGroup prmDummy1 (ocf::pacemaker:Dummy): Started test1 prmDummy2 (ocf::pacemaker:Dummy): Started test1 Resource Group: grpStonith1 prmStonithN1 (stonith:external/ssh): Started test2 Resource Group: grpStonith2 prmStonithN2 (stonith:external/ssh): Started test1 Migration summary: * Node test2: * Node test1: Step2) Stop Slave node. [root@test2 ~]# service heartbeat stop Stopping High-Availability services: Done. Step3) Stop Master node. However, a loop does the Master node and does not stop. (snip) Mar 27 14:38:06 test1 crmd: [21443]: WARN: run_graph: Transition 3 (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=23, Source=/var/lib/pengine/pe-input-3.bz2): Terminated Mar 27 14:38:06 test1 crmd: [21443]: ERROR: te_graph_trigger: Transition failed: terminated Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Graph 3 (30 actions in 30 synapses): batch-limit=30 jobs, network-delay=6ms Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 0 is pending (priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: [Action 12]: Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 14]: Completed (id: testMsGroup01:0_demote_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 32]: Pending (id: msGroup01_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 1 is pending (priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: [Action 13]: Pending (id: testMsGroup01:0_stopped_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 8]: Pending (id: prmStateful1:0_stop_0, loc: test1, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 9]: Pending (id: prmStateful2:0_stop_0, loc: test1, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem: * [Input 12]: Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0) Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 2 was confirmed (priority: 0) (snip) I attach data of hb_report. Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Nodes not rejoining cluster
I had a circuit breaker go out and take two of the 5 nodes in my cluster down. Now that their back up and running, they are not rejoining the cluster. Here is what I get from crm_mon -1 node 1,2 and 3 itchy, scratchy and walter show the following: Last updated: Thu Mar 29 19:04:05 2012 Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter Stack: openais Current DC: walter - partition with quorum Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 5 Nodes configured, 5 expected votes 9 Resources configured. Online: [ itchy scratchy walter butthead timmy ] On butthead I get Last updated: Thu Mar 29 19:04:24 2012 Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy Stack: openais Current DC: NONE 5 Nodes configured, 5 expected votes 9 Resources configured. OFFLINE: [ itchy scratchy walter butthead timmy ] On Timmy, I get Last updated: Thu Mar 29 19:04:20 2012 Last change: Current DC: NONE 0 Nodes configured, unknown expected votes 0 Resources configured. I don't have anything important running yet. so I can do a full clean up of everything if needed. I also get some weird behavior with timmy. I brought this node up with the host name as timmy.example.com and I changed the host name to timmy but when the cluster is offline timmy.example.com shows up as offline. I enter crm node delete timmy.example.com and it goes away until timmy goes offline again. Thanks, Gregg Stock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Nodes not rejoining cluster
Gotta have logs. From all 3 nodes mentioned. Only then can we determine if the problem is at the corosync or pacemaker layer - which is the pre-requisit for figuring out what to do next :) On Fri, Mar 30, 2012 at 1:30 PM, Gregg Stock gr...@damagecontrolusa.com wrote: I had a circuit breaker go out and take two of the 5 nodes in my cluster down. Now that their back up and running, they are not rejoining the cluster. Here is what I get from crm_mon -1 node 1,2 and 3 itchy, scratchy and walter show the following: Last updated: Thu Mar 29 19:04:05 2012 Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter Stack: openais Current DC: walter - partition with quorum Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 5 Nodes configured, 5 expected votes 9 Resources configured. Online: [ itchy scratchy walter butthead timmy ] On butthead I get Last updated: Thu Mar 29 19:04:24 2012 Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy Stack: openais Current DC: NONE 5 Nodes configured, 5 expected votes 9 Resources configured. OFFLINE: [ itchy scratchy walter butthead timmy ] On Timmy, I get Last updated: Thu Mar 29 19:04:20 2012 Last change: Current DC: NONE 0 Nodes configured, unknown expected votes 0 Resources configured. I don't have anything important running yet. so I can do a full clean up of everything if needed. I also get some weird behavior with timmy. I brought this node up with the host name as timmy.example.com and I changed the host name to timmy but when the cluster is offline timmy.example.com shows up as offline. I enter crm node delete timmy.example.com and it goes away until timmy goes offline again. Thanks, Gregg Stock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org