[Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot
Hi everyone. I have 2 nodes running on ESX hosts in 2 geographically diverse data centres. The link between them is a DWDM fibre link which is the only thing I can think of as being the cause of this. SLES 11 SP1 with HAE. All latest updates. If Corosync is set to Multicast on the default address, there are no comms between Corosync on the nodes. If I use broadcast, it will communicate and let the nodes join. If I reboot node 2, it rejoins fine. If I reboot node 1, it enters a pending phase for a while then just drops to offline. I can then clear the config out again and let the nodes rejoin. Node 1 always seems to be the DC. Pending - logs from node 1, loops this every second: -02: id=336371722 state=member (new) addr=r(0) ip(10.160.12.20) votes=1 born=7912 seen=7920 proc=00151312 Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: crm_update_peer: Node PPS-VMAIL-01: id=168599562 state=member (new) addr=r(0) ip(10.160.12.10) (new) votes=1 (new) born=7920 seen=7920 proc=00151312 (new) Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: WARN: do_log: FSA: Input I_SHUTDOWN from revision_check_callback() received in state S_STARTING Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_state_transition: State transition S_STARTING - S_STOPPING [ input=I_SHUTDOWN cause=C_FSA_INTERNAL origin=revision_check_callback ] Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_lrm_control: Disconnected from the LRM Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_ha_control: Disconnected from OpenAIS Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_cib_control: Disconnecting CIB Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping I_NULL: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=register_fsa_error_adv ] Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] Apr 2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: [crmd] stopped (0) Offline - logs from node 1, loops every second: Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_replace_notify: Local-only Replace: 0.0.0 from PP2-VMAIL-02 Apr 2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: do_cib_replaced: Sending full refresh Apr 2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (null) Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: apply_xml_diff: Digest mis-match: expected 0cf389141d344ca552679f9924d281c5, calculated 818a100a0e3b725068393624381c9d4f Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: notice: cib_process_diff: Diff 0.13.642 - 0.0.0 not applied to 0.13.642: Failed application of an update diff Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_server_process_diff: Requesting re-sync from peer Apr 2 14:38:06 PPS-VMAIL-01 cib: [3510]: WARN: cib_diff_notify: Local-only Change (client:attrd, call: 1221): 0.0.0 (Application of an update diff failed, requesting a full refresh) Offline - logs from node 2, loops every second: Apr 2 14:39:05 PP2-VMAIL-02 corosync[3794]: [TOTEM ] Retransmit List: 29b7 29b8 29b9 Apr 2 14:39:05 PP2-VMAIL-02 corosync[3794]: [TOTEM ] Retransmit List: 29bb 29bc Apr 2 14:39:05 PP2-VMAIL-02 cib: [3801]: info: cib_process_request: Operation complete: op cib_sync_one for section 'all' (origin=PPS-VMAIL-01/PPS-VMAIL-01/(null), version=0.13.1538): ok (rc=0) Any ideas please? Thanks. Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot
On 2012-04-02T14:53:53, darren.mans...@opengi.co.uk wrote: I have 2 nodes running on ESX hosts in 2 geographically diverse data centres. The link between them is a DWDM fibre link which is the only thing I can think of as being the cause of this. SLES 11 SP1 with HAE. All latest updates. That looks timing related; what bandwidth/latency do you get between the two sites? No theoretical bandwidth quoted from supplier, but in the hundreds of Mbits I believe. 670ns latency. You know that geographical clusters is not an officially supported deployment scenario for SP1, right? ;-) That's documented in the release notes and changing only with SP2. Regards, Lars I think the bandwidth and latency are sufficient to happily accept this as local :) But no, I didn't know that unfortunately. If I can't get Corosync communications working I'll just have to drop the HA and use external load-balancers to direct traffic. thanks. Darren ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1
(Originally sent to DRBD-user, reposted here as it may be more relevant) Hello all. I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2 for dual-primary shared FS. I've followed the instructions on the DRBD applications site and it works really well. However, if I 'pull the plug' on a node, the other node continues to operate the clones, but the filesystem is locked and inaccessible (the monitor op works for the filesystem, but fails for the OCFS2 resource.) If I do a reboot one node, there are no problems and I can continue to access the OCFS2 FS. After I pull the plug: Online: [ test-odp-02 ] OFFLINE: [ test-odp-01 ] Resource Group: Load-Balancing Virtual-IP-ODP (ocf::heartbeat:IPaddr2): Started test-odp-02 Virtual-IP-ODPWS (ocf::heartbeat:IPaddr2): Started test-odp-02 ldirectord (ocf::heartbeat:ldirectord):Started test-odp-02 Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2] Masters: [ test-odp-02 ] Stopped: [ p_drbd_ocfs2:1 ] Clone Set: cl-odp [odp] Started: [ test-odp-02 ] Stopped: [ odp:1 ] Clone Set: cl-odpws [odpws] Started: [ test-odp-02 ] Stopped: [ odpws:1 ] Clone Set: cl_fs_ocfs2 [p_fs_ocfs2] Started: [ test-odp-02 ] Stopped: [ p_fs_ocfs2:1 ] Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt] Started: [ test-odp-02 ] Stopped: [ g_ocfs2mgmt:1 ] Failed actions: p_o2cb:0_monitor_1 (node=test-odp-02, call=19, rc=-2, status=Timed Out): unknown exec error test-odp-02:~ # mount /dev/drbd0 on /opt/odp type ocfs2 (rw,_netdev,noatime,cluster_stack=pcmk) test-odp-02:~ # ls /opt/odp ...just hangs forever... If I then power test-odp-01 back on, everything fails back fine and the ls command suddenly completes. It seems to me that OCFS2 is trying to talk to the node that has disappeared and doesn't time out. Does anyone have any ideas? (attached CRM and DRBD configs) Many thanks. Darren Mansell drbd.conf Description: drbd.conf node test-odp-01 node test-odp-02 \ attributes standby=off primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=2.21.15.100 cidr_netmask=8 broadcast=2.255.255.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=2.21.15.103 cidr_netmask=8 broadcast=2.255.255.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive ldirectord ocf:heartbeat:ldirectord \ params configfile=/etc/ha.d/ldirectord.cf \ op monitor interval=2m timeout=20s \ meta migration-threshold=10 failure-timeout=600 primitive odp lsb:odp \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive odpwebservice lsb:odpws \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive p_controld ocf:pacemaker:controld \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive p_drbd_ocfs2 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/r0 directory=/opt/odp fstype=ocfs2 options=rw,noatime \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive p_o2cb ocf:ocfs2:o2cb \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 group Load-Balancing Virtual-IP-ODP Virtual-IP-ODPWS ldirectord group g_ocfs2mgmt p_controld p_o2cb ms ms_drbd_ocfs2 p_drbd_ocfs2 \ meta master-max=2 clone-max=2 notify=true clone cl-odp odp clone cl-odpws odpws clone cl_fs_ocfs2 p_fs_ocfs2 \ meta target-role=Started clone cl_ocfs2mgmt g_ocfs2mgmt \ meta interleave=true location Prefer-Node1 ldirectord \ rule $id=prefer-node1-rule 100: #uname eq test-odp-01 order o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start order tomcatlast1 inf: cl_fs_ocfs2 cl-odp order tomcatlast2 inf: cl_fs_ocfs2 cl-odpws property $id=cib-bootstrap-options \ dc-version=1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ start-failure-is-fatal=false \ stonith-action=reboot \ stonith-enabled=false \ last-lrm-refresh=1317207361___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:
Re: [Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1
Sorry for top-posting, I'm Outlook-afflicted. This is also my problem; In the full production environment there will be low-level hardware fencing by means of IBM RSA/ASM but this is a VMware test environment. The vmware STONITH plugin is dated and doesn't seem to work correctly (I gave up quickly due to the author of the plugin stating on this list that it probably won't work) and SSH STONITH seems to have been removed, not that it would do much good in this circumstance. Therefore, there's no way to set up STONITH in a test environment in VMware which is where I believe a lot of people architect solutions these days, so there's no way to prove a solution works. I'll attempt to modify and improve the VMware STONITH agent but I'm not sure how in this situation where a node has gone away and left a single remaining node, but the remaining node is then failing, how STONITH could help? Is this where the suicide agent comes in? Regards, Darren -Original Message- From: Nick Khamis [mailto:sym...@gmail.com] Sent: 29 September 2011 15:48 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1 Hello Dejan, Sorry to hijack, I am also working on the same type of setup as a prototype. What is the best way to get stonith included for VM setups? Maybe an SSH stonith? Again, this is just for the prototype. Cheers, Nick. On Thu, Sep 29, 2011 at 9:28 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi Darren, On Thu, Sep 29, 2011 at 02:15:34PM +0100, darren.mans...@opengi.co.uk wrote: (Originally sent to DRBD-user, reposted here as it may be more relevant) Hello all. I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2 for dual-primary shared FS. I've followed the instructions on the DRBD applications site and it works really well. However, if I 'pull the plug' on a node, the other node continues to operate the clones, but the filesystem is locked and inaccessible (the monitor op works for the filesystem, but fails for the OCFS2 resource.) If I do a reboot one node, there are no problems and I can continue to access the OCFS2 FS. After I pull the plug: Online: [ test-odp-02 ] OFFLINE: [ test-odp-01 ] Resource Group: Load-Balancing Virtual-IP-ODP (ocf::heartbeat:IPaddr2): Started test-odp-02 Virtual-IP-ODPWS (ocf::heartbeat:IPaddr2): Started test-odp-02 ldirectord (ocf::heartbeat:ldirectord): Started test-odp-02 Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2] Masters: [ test-odp-02 ] Stopped: [ p_drbd_ocfs2:1 ] Clone Set: cl-odp [odp] Started: [ test-odp-02 ] Stopped: [ odp:1 ] Clone Set: cl-odpws [odpws] Started: [ test-odp-02 ] Stopped: [ odpws:1 ] Clone Set: cl_fs_ocfs2 [p_fs_ocfs2] Started: [ test-odp-02 ] Stopped: [ p_fs_ocfs2:1 ] Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt] Started: [ test-odp-02 ] Stopped: [ g_ocfs2mgmt:1 ] Failed actions: p_o2cb:0_monitor_1 (node=test-odp-02, call=19, rc=-2, status=Timed Out): unknown exec error test-odp-02:~ # mount /dev/drbd0 on /opt/odp type ocfs2 (rw,_netdev,noatime,cluster_stack=pcmk) test-odp-02:~ # ls /opt/odp ...just hangs forever... If I then power test-odp-01 back on, everything fails back fine and the ls command suddenly completes. It seems to me that OCFS2 is trying to talk to the node that has disappeared and doesn't time out. Does anyone have any ideas? (attached CRM and DRBD configs) With stonith disabled, I doubt that your cluster can behave as it should. Thanks, Dejan Many thanks. Darren Mansell Content-Description: crm.txt node test-odp-01 node test-odp-02 \ attributes standby=off primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=2.21.15.100 cidr_netmask=8 broadcast=2.255.255.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=2.21.15.103 cidr_netmask=8 broadcast=2.255.255.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive ldirectord ocf:heartbeat:ldirectord \ params configfile=/etc/ha.d/ldirectord.cf \ op monitor interval=2m timeout=20s \ meta migration-threshold=10 failure-timeout=600 primitive odp lsb:odp \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive odpwebservice lsb:odpws \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600 primitive p_controld ocf:pacemaker:controld \ op monitor interval=10s enabled=true timeout=10s \ meta migration-threshold=10 failure-timeout=600
Re: [Pacemaker] Help With Cluster Failure
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: 08 April 2011 08:15 To: The Pacemaker cluster resource manager Cc: Darren Mansell Subject: Re: [Pacemaker] Help With Cluster Failure On Thu, Apr 7, 2011 at 12:12 PM, darren.mans...@opengi.co.uk wrote: Hi all. One of my clusters had a STONITH shoot-out last night and then refused to do anything but sit there from 0400 until 0735 after I'd been woken up to fix it. In the end, just a resource cleanup fixed it, which I don't think should be the case. I have an 8MB hb_report file. Is that too big to attach to send here? Should I upload it somewhere? Is there somewhere you can put it and send us a URL? Absolutely. Thanks Andrew. www.mysqlsimplecluster.com/HB_report/DM_report_1.tar.bz2 Darren ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Help With Cluster Failure
Hi all. One of my clusters had a STONITH shoot-out last night and then refused to do anything but sit there from 0400 until 0735 after I'd been woken up to fix it. In the end, just a resource cleanup fixed it, which I don't think should be the case. I have an 8MB hb_report file. Is that too big to attach to send here? Should I upload it somewhere? Thanks. Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue
From: Pavel Levshin [mailto:pa...@levshin.spb.ru] Sent: 25 March 2011 19:50 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue 25.03.2011 18:47, darren.mans...@opengi.co.uk: We configure a virtual IP on the non-arping lo interface of both servers and then configure the IPaddr2 resource with lvs_support=true. This RA will remove the duplicate IP from the lo interface when it becomes active. Grouping the VIP with ldirectord/LVS we can have the load-balancer and VIP on one node, balancing traffic to the other node with failover where both resources failover together. To do this we need to configure the VIP on lo as a 32 bit netmask but the VIP on the eth0 interface needs to have a 24 bit netmask. This has worked fine up until now and we base all of our clusters on this method. Now what happens is that the find_interface() routine in IPaddr2 doesn't remove the IP from lo when starting the VIP resource as it can't find it due to the netmask not matching. Do you really need the address to be deleted from lo? Having two identical addresses on the Linux machine should not harm, if routing was not affected. In your case, with /32 netmask on lo, I do not foresee any problems. We use it in this way, i.e. with the address set on lo permanently. -- Pavel Levshin Thanks Pavel, However, this means I would have to disable LVS support for the resource. Which means that to make it work with LVS I have to set lvs_support to false. Of course, I'll do whatever it takes on my set up to make it work, but it's not intuitive for other users. Regards Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster
湏䴠湯〲ㄱ〭ⴳ㐱愠⁴㈱ㄺ‹〫〱ⰰ䄠摮敲⁷敂步潨牷瑯㩥㸊㸠ㄠ†††剄䑂搠敯湳璢瀠潲潭整搯浥瑯潣牲捥汴圠敨敮敶⁉慨敶愠㸊映楡潬敶Ⱳ㸊㸠琠敨䐠䉒⁄敲潳牵散眠汩番瑳猠瑩琠敨敲漠桴牷湯潮敤潨摬湩灵㸊愠汬㸊㸠漠桴牥漠数慲楴湯䤠ꉴ楬敫琠敨搠浥瑯敮敶慨灰湥丠瑯楨杮椠ੳ‾潬杧摥眠敨੮‾‾桴獩栠灡数獮瑩樠獵⁴楳獴映牯癥牥眠瑩慨晬漠桴敲潳牵散ੳ‾瑳灯数湡‾‾剄䑂洠獡整湯琠敨眠潲杮渠摯㸊ਠ‾潆桴獩愠⁴敬獡⁴❉湥潣牵条畢敲潰瑲眠瑩扨牟灥牯⁴牡档癩㸊圠瑩潨瑵琠敨氠杯ⱳ琠敨挠湯楦畧慲楴湯愠潬敮眠湯⁴整汬甠畭档ਊ桁⁉慷潬歯湩潦慨牟灥牯⁴湡獡畳敭瑩栠摡戠敥敲潭敶牦浯洠慰正条獥漠慷湳琧椠桴慰正条獥䤠栠摡❉汬搠桴瑡渠硥⁴楴敭䤠挠湡焊慵瑮晩⁹硥捡汴⁹桷湥椠⁴慨灰湥ਊ敒慧摲ⱳ䐊牡敲੮ ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster
On Mon, 2011-03-14 at 17:35 +0100, Dejan Muhamedagic wrote: Hi, On Mon, Mar 14, 2011 at 10:57:27AM -, darren.mans...@opengi.co.uk wrote: Hello everyone. I built and put into production without adequate testing a 2 node cluster running Ubuntu 10.04 LTS with Pacemaker and associated packages from the Ubuntu-HA-maintainers repo (https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa). Not good to go live without sufficient testing. Testing is as important as anything else. Or even more important. If there isn't enough time for testing, then better to go without clustering. I've very quickly realised this fact. Even if under pressure to put a cluster live, don't give in until you're 100% happy with it. It WILL bite you, and it won't be anyone else's fault but yours. 2. Crm shell won't load from a text file. When I use crm configure crm.txt, it will run through the file, complaining about the default timeout being less than 240, but doesn't load anything. So I go into the crm shell and set default-action-timeout to 240, commit and exit and do the same. This time it just exits silently, without loading the config. Strange. I assume that you run version 1.0.x which I don't use very often, but I cannot recall seeing this problem. I'm not sure if I need to put a commit at the end of the input file? I always assumed it had an implicit commit. I'll test this next time I get chance. If I go into the crm shell and use load replace crm.txt it will work. Loading from a file was really meant to be done with configure load. Now, if there are errors/warnings in the configuration, what happens depends on check-* options for semantic checks. I'll try that armed with this information next time. 3. Crm shell tab completes don't work unless you put an incorrect entry in first. I'm sure this is a python readline problem, as it also happens in SLE 11 HAE SP1 (but not in pre-SP1). I assume everyone associated (Dejan?) is aware of the problem, but highlighting it just in case. No, I'm not aware of it. Tab completion works here, though a bit differently from 1.0 due to lazy creation of the completion tables. You need to enter another level at least once before the tab completion is going to work for that level. For instance, it won't work in this case: crm(live)# resource TABTAB But it would once the user enters the resource level: crm(live)resource# TABTAB bye failcount move restart unmigrate cdhelp param show unmove cleanup list promote start up demotemanagequit status utilization end meta refresh stop exit migrate reprobe unmanage Can you elaborate put incorrect entry first? I think this is more down to my lack of understanding of how it's changed then. I'm used to 1.0 clusters and the crm shell would always tab complete *almost* everything. IIRC only location score rules etc wouldn't. However, I think my confusion has arisen due to this behaviour: crm(live)# resource miTABTAB nothing crm(live)# resource mienter ERROR: syntax: mi crm(live)# resource miTAB crm(live)# resource migrate It will tab-complete the first and second level, if you've already entered an incorrect parameter. Regards, Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster
I'm sorry if it came through to you that way. It's the challenges I face as an Enterprise IT worker using Linux as my desktop. Either I use Outlook in a Windows VM and top post like this, or use Evolution, quote correctly but potentially cause encoding/unicode issues.. Regards, Darren Mansell -Original Message- From: Digimer [mailto:li...@alteeve.com] Sent: 15 March 2011 14:41 To: The Pacemaker cluster resource manager Cc: Darren Mansell; and...@beekhof.net Subject: Re: [Pacemaker] Lots of Issues with Live Pacemaker Cluster On 03/15/2011 10:15 AM, darren.mans...@opengi.co.uk wrote: Gibberish I'm not sure what that was supposed to be about, but it doesn't look like sane Chinese. Despite a few kana, it certainly wasn't Japanese. If it was spam... -- Digimer E-Mail: digi...@alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Linux-HA] Solved: SLES 11 HAE SP1 Signon to CIB Failed
So I compared the /etc/ais/openais.conf in non-sp1 with /etc/corosync/corosync.conf from sp1 and found this bit missing which could be quite useful... service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker use_mgmtd: yes use_logd: yes } Added it and it works. Doh. It seems the example corosync.conf that is shipped won't start pacemaker, I'm not sure if that's on purpose or not, but I found it a bit confusing after being used to it 'just working' previously. Ah. Understandably confusing. That got fixed post-SP1, in a maintenance update that went out in September or thereabouts. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. --- Thanks Tim. Although the media that can be downloaded *now* from Novell downloads still has this issue, so any new clusters will fall foul of it. Generally with a test build you won't perform updates as it burns a licence you would need for the production system. Should the downloadable media have the issue fixed? Regards, Darren ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Solved: [Linux-HA] SLES 11 HAE SP1 Signon to CIB Failed
On Fri, Jan 28, 2011 at 1:06 PM, darren.mans...@opengi.co.uk wrote: Hi all, this seems like it should be an easy one to fix, I'll raise a support call with Novell if required. Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives 'signon to CIB failed'. Same thing with the CRM shell etc. Too many open file descriptors? lsof might show something interesting --- Unfortunately not. It seems that corosync doesn't spawn anything else, which is causing this issue: From a SLES 11 HAE install: root 7342 5.6 0.2 166048 38924 ?SLl 2010 5685:08 aisexec root 7349 0.0 0.0 67768 10516 ?SLs 2010 3:02 \_ /usr/lib64/heartbeat/stonithd 907350 0.0 0.0 65028 4656 ?S 2010 7:43 \_ /usr/lib64/heartbeat/cib nobody7351 0.0 0.0 61600 1832 ?S 2010 8:24 \_ /usr/lib64/heartbeat/lrmd 907352 0.0 0.0 66284 2320 ?S 2010 0:00 \_ /usr/lib64/heartbeat/attrd 907353 0.0 0.0 67536 3588 ?S 2010 1:24 \_ /usr/lib64/heartbeat/pengine 907354 0.0 0.0 72392 3712 ?S 2010 6:01 \_ /usr/lib64/heartbeat/crmd root 7355 0.0 0.0 75148 2504 ?S 2010 2:25 \_ /usr/lib64/heartbeat/mgmtd root 4040 0.0 0.0 0 0 ?Z 2010 0:00 \_ [aisexec] defunct root 4059 0.0 0.0 0 0 ?Z 2010 0:00 \_ [aisexec] defunct From a SLES 11 SP1 HAE install: root 9109 0.0 0.4 13308 2288 tty1 Ss+ Feb02 0:00 \_ -bash root 8989 0.0 0.1 4344 744 tty2 Ss+ Feb02 0:00 /sbin/mingetty tty2 root 8990 0.0 0.1 4344 752 tty3 Ss+ Feb02 0:00 /sbin/mingetty tty3 root 8991 0.0 0.1 4344 748 tty4 Ss+ Feb02 0:00 /sbin/mingetty tty4 root 8992 0.0 0.1 4344 748 tty5 Ss+ Feb02 0:00 /sbin/mingetty tty5 root 8993 0.0 0.1 4344 744 tty6 Ss+ Feb02 0:00 /sbin/mingetty tty6 root 24883 0.0 0.8 89808 4424 ?Ssl Feb02 0:34 /usr/sbin/corosync lookup-01:~ # So I compared the /etc/ais/openais.conf in non-sp1 with /etc/corosync/corosync.conf from sp1 and found this bit missing which could be quite useful... service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker use_mgmtd: yes use_logd: yes } Added it and it works. Doh. It seems the example corosync.conf that is shipped won't start pacemaker, I'm not sure if that's on purpose or not, but I found it a bit confusing after being used to it 'just working' previously. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] SLES 11 HAE SP1 Signon to CIB Failed
Hi all, this seems like it should be an easy one to fix, I'll raise a support call with Novell if required. Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives 'signon to CIB failed'. Same thing with the CRM shell etc. All the logs look fine and I'm root. It's using corosync / pacemaker. Any ideas? Has anyone seen this before? thanks Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [lvs-users] is it possible to have ldirector and real cluster server on same physical machine?
Check the /var/log/ldirectord.log file for errors and check you can manually start it yourself: rcldirectord restart I've had to compile a Perl module myself for ldirector in SLES 11 HAE: http://www.clusterlabs.org/wiki/Load_Balanced_MySQL_Replicated_Cluster#Missing_Perl_Socket6 You also need lvs_support=true in your ldirectord resource. I've added this to the pacemaker list as it may be more suited for support there. Darren Mansell -Original Message- From: lvs-users-boun...@linuxvirtualserver.org [mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of Mrvka Andreas Sent: 06 December 2010 08:44 To: LinuxVirtualServer.org users mailing list. Subject: Re: [lvs-users] is it possible to have ldirector and real cluster server on same physical machine? Hello list, sorrily I didn't succeed last week in deploying the cluster. Please can anybody show me the error? It has to be somewhere very deep inside. I only want to have a two node cluster with apache load balanced and fail-overing. It shouldn't be that complex - but where is the error? Maby the solution or this configs will help others. Here my ldirectord.cf (with TABs) autoreload = yes checkinterval = 10 checktimeout = 3 logfile = /var/log/ldirectord.log quiescent = yes virtual = 10.10.11.60:80 checktype = negotiate fallback = 127.0.0.1:80 protocol = tcp real = 10.10.11.61:80 gate real = 10.10.11.62:80 gate receive = Still alive request = test.html scheduler = wlc service = http My crm configure: node linlbtemp01 node linlbtemp02 primitive ClusterIP ocf:heartbeat:IPaddr2 \ operations $id=ClusterIP-operations \ op monitor interval=5s timeout=20s \ params ip=10.10.11.60 nic=lo cidr_netmask=16 lvs_support=true primitive Virtual-IP-Apache ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=10.10.11.60 cidr_netmask=16 broadcast=255.255.255.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 primitive apache ocf:heartbeat:apache \ op monitor interval=30s timeout=10s \ meta migration-threshold=10 target-role=Started \ params configfile=/etc/apache2/httpd.conf httpd=/usr/sbin/httpd testurl=/test.html primitive ldirectord ocf:heartbeat:ldirectord \ params configfile=/etc/ha.d/ldirectord.cf \ op monitor interval=2m timeout=20s \ meta migration-threshold=10 target-role=Started group Load-Balancing Virtual-IP-Apache ldirectord clone cl-apache apache location Prefer-Node1 ldirectord \ rule $id=prefer-node1-rule 100: #uname eq linlbtemp01 property $id=cib-bootstrap-options \ dc-version=1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b \ cluster-infrastructure=openais \ expected-quorum-votes=2 My /etc/sysctl: # Disable response to broadcasts. # You don't want yourself becoming a Smurf amplifier. net.ipv4.icmp_echo_ignore_broadcasts = 1 # enable route verification on all interfaces net.ipv4.conf.all.rp_filter = 1 # enable ipV6 forwarding #net.ipv6.conf.all.forwarding = 1 # increase the number of possible inotify(7) watches fs.inotify.max_user_watches = 65536 # avoid deleting secondary IPs on deleting the primary IP #net.ipv4.conf.default.promote_secondaries = 1 #net.ipv4.conf.all.promote_secondaries = 1 #net.ipv4.conf.lo.arp_ignore = 1 #net.ipv4.conf.lo.arp_announce = 2 #net.ipv4.conf.all.arp_ignore = 1 #net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.eth0.arp_ignore = 1 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.eth0.arp_announce = 2 net.ipv4.ip_forward = 1 My ifcfg-lo: IPADDR=127.0.0.1 NETMASK=255.0.0.0 NETWORK=127.0.0.0 BROADCAST=127.255.255.255 IPADDR_2=127.0.0.2/8 STARTMODE=onboot USERCONTROL=no FIREWALL=no IPADDR_0=10.10.11.60 #VIP NETMASK_0=255.255.255.255 NETWORK_0=10.10.11.0 BROADCAST_0=10.10.11.255 LABEL_0=0 Actually it seems, that my ldirectord out of openais does not start. Can anybody point me to the error? Thanks a lot in advance. Andrew -Original Message- From: lvs-users-boun...@linuxvirtualserver.org [mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of darren.mans...@opengi.co.uk Sent: Freitag, 3. Dezember 2010 14:53 To: lvs-us...@linuxvirtualserver.org Subject: Re: [lvs-users] is it possible to have ldirectorand realcluster server on same physical machine? Glad it helped. This is my original howto for this kind of setup: http://www.clusterlabs.org/wiki/Load_Balanced_MySQL_Replicated_Cluster darren -Original Message- From: lvs-users-boun...@linuxvirtualserver.org [mailto:lvs-users-boun...@linuxvirtualserver.org] On Behalf Of Mrvka Andreas Sent: 03 December 2010 13:46 To: 'LinuxVirtualServer.org users mailing list.' Subject: Re: [lvs-users] is it possible to have ldirectorand realcluster server on same physical machine? Hi Darren, thank you for the detailed infos. I've read out of your messages that in
[Pacemaker] Help with understanding CIB scores
Hi all. Could anyone give me any pointers on how to easily find out what is stopping resources moving to a preferred node as expected? I'm looking at the ptest -Ls output and can see there is a greater score for a resource on another node than the node I am specifically locating. I can't see in the logs using grep lrmd.*$res /var/log/syslog anything that would indicate what's going wrong. I'm using Pacemaker 1.0.8+hg15494-2ubuntu2 on Ubuntu Lucid (10.04) with quite a large CIB. Many thanks. Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Time/Date Based Expressions in the CRM Shell
Apologies if this is in the documentation but I can't see how to use the time/date based expression resource constraints in the CRM shell. Can anyone provide an example config or point me to any documentation for how to use it? I'm trying to use these constraints to run scripts at certain times using the Anything RA, essentially using Pacemaker like an advanced cron. Does this sound like the right way to do it? Cheers Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Time/Date Based Expressions in the CRM Shell
-Original Message- From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Sent: 31 March 2010 11:09 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Time/Date Based Expressions in the CRM Shell Hi, On Wed, Mar 31, 2010 at 10:56:29AM +0100, darren.mans...@opengi.co.uk wrote: Apologies if this is in the documentation but I can't see how to use the time/date based expression resource constraints in the CRM shell. There's usage in the crm shell documentation and help for the location constraint (crm configure help location). There's nothing about the date format, but you should be able to find about that in the configuration explained document. It's some ISO standard (looks a bit awkward). Thanks, Dejan I've added the resource and location constraint: primitive QuoteCountGrab ocf:heartbeat:anything \ params binfile=/usr/local/bin/qc-grab.sh errlogfile=/var/log/qc-grab.err logfile=/var/log/qc-grab.log pidfile=/var/run/qc-grab.pid user=root location QuoteCountGrabSchedule QuoteCountGrab rule inf: date date_spec hours=0-4 but it seems to have started the resource right away and ignored the date_spec. I've read the section on ensuring time based rules take effect and Ive set cluster-recheck-interval to 5m but the resource is still running. Any ideas? Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DRBD Recovery Policies
The odd thing is - it didn't. From my test, it failed back, re-promoted NodeA to be the DRBD master and failed all grouped resources back too. Everything was working with the ~7GB of data I had put onto NodeB while NodeA was down, now available on NodeA... /proc/drbd on the slave said Secondary/Primary UpToDate/Inconsistent while it was syncing data back - so it was able to mount the inconsistent data on the primary node and access the files that hadn't yet sync'd over?! I mounted a 4GB ISO that shouldn't have been able to be there yet and was able to access data inside it.. Is my understanding of DRBD limited and it's actually able to provide access to not fully sync'd files over the network link or something? If so - wow. I'm confused ;) -Original Message- From: Menno Luiten [mailto:mlui...@artifix.net] Sent: 11 March 2010 19:35 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] DRBD Recovery Policies Hi Darren, I believe that this is handled by DRBD by fencing the Master/Slave resource during resync using Pacemaker. See http://www.drbd.org/users-guide/s-pacemaker-fencing.html. This would prevent Node A to promote/start services with outdated data (fence-peer), and it would be forced to wait with takeover until the resync is completed (after-resync-target). Regards, Menno Op 11-3-2010 15:52, darren.mans...@opengi.co.uk schreef: I've been reading the DRBD Pacemaker guide on the DRBD.org site and I'm not sure I can find the answer to my question. Imagine a scenario: (NodeA NodeB Order and group: M/S DRBD Promote/Demote FS Mount Other resource that depends on the F/S mount DRBD master location score of 100 on NodeA) NodeA is down, resources failover to NodeB and everything happily runs for days. When NodeA is brought back online it isn't treated as split-brain as a normal demote/promote would happen. But the data on NodeA would be very old and possibly take a long time to sync from NodeB. What would happen in this scenario? Would the RA defer the promote until the sync is completed? Would the inability to promote cause the failback to not happen and a resource cleanup is required once the sync has completed? I guess this is really down to how advanced the Linbit DRBD RA is? Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DRBD Recovery Policies
Fairly standard, but I don't really want it to be fenced, as I want to keep the data that has been updated on the single remaining nodeB while NodeA was being repaired: global { dialog-refresh 1; minor-count 5; } common { syncer { rate 10M; } } resource cluster_disk { protocol C; disk { on-io-error pass_on; } syncer { } handlers { split-brain /usr/lib/drbd/notify-split-brain.sh root; } net { after-sb-1pri discard-secondary; } startup { wait-after-sb; } on cluster1 { device/dev/drbd0; address 12.0.0.1:7789; meta-disk internal; disk /dev/sdb1; } on cluster2 { device/dev/drbd0; address 12.0.0.2:7789; meta-disk internal; disk /dev/sdb1; } } -Original Message- From: Menno Luiten [mailto:mlui...@artifix.net] Sent: 12 March 2010 10:05 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] DRBD Recovery Policies Are you absolutely sure you set the resource-fencing parameters correctly in your drbd.conf (you can post your drbd.conf if unsure) and reloaded the configuration? On 12-03-10 10:48, darren.mans...@opengi.co.uk wrote: The odd thing is - it didn't. From my test, it failed back, re-promoted NodeA to be the DRBD master and failed all grouped resources back too. Everything was working with the ~7GB of data I had put onto NodeB while NodeA was down, now available on NodeA... /proc/drbd on the slave said Secondary/Primary UpToDate/Inconsistent while it was syncing data back - so it was able to mount the inconsistent data on the primary node and access the files that hadn't yet sync'd over?! I mounted a 4GB ISO that shouldn't have been able to be there yet and was able to access data inside it.. Is my understanding of DRBD limited and it's actually able to provide access to not fully sync'd files over the network link or something? If so - wow. I'm confused ;) -Original Message- From: Menno Luiten [mailto:mlui...@artifix.net] Sent: 11 March 2010 19:35 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] DRBD Recovery Policies Hi Darren, I believe that this is handled by DRBD by fencing the Master/Slave resource during resync using Pacemaker. See http://www.drbd.org/users-guide/s-pacemaker-fencing.html. This would prevent Node A to promote/start services with outdated data (fence-peer), and it would be forced to wait with takeover until the resync is completed (after-resync-target). Regards, Menno Op 11-3-2010 15:52, darren.mans...@opengi.co.uk schreef: I've been reading the DRBD Pacemaker guide on the DRBD.org site and I'm not sure I can find the answer to my question. Imagine a scenario: (NodeA NodeB Order and group: M/S DRBD Promote/Demote FS Mount Other resource that depends on the F/S mount DRBD master location score of 100 on NodeA) NodeA is down, resources failover to NodeB and everything happily runs for days. When NodeA is brought back online it isn't treated as split-brain as a normal demote/promote would happen. But the data on NodeA would be very old and possibly take a long time to sync from NodeB. What would happen in this scenario? Would the RA defer the promote until the sync is completed? Would the inability to promote cause the failback to not happen and a resource cleanup is required once the sync has completed? I guess this is really down to how advanced the Linbit DRBD RA is? Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] DRBD Recovery Policies
I've been reading the DRBD Pacemaker guide on the DRBD.org site and I'm not sure I can find the answer to my question. Imagine a scenario: (NodeA NodeB Order and group: M/S DRBD Promote/Demote FS Mount Other resource that depends on the F/S mount DRBD master location score of 100 on NodeA) NodeA is down, resources failover to NodeB and everything happily runs for days. When NodeA is brought back online it isn't treated as split-brain as a normal demote/promote would happen. But the data on NodeA would be very old and possibly take a long time to sync from NodeB. What would happen in this scenario? Would the RA defer the promote until the sync is completed? Would the inability to promote cause the failback to not happen and a resource cleanup is required once the sync has completed? I guess this is really down to how advanced the Linbit DRBD RA is? Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DRBD and fencing
On Wed, Mar 10, 2010 at 02:32:05PM +0800, Martin Aspeli wrote: Florian Haas wrote: On 03/09/2010 06:07 AM, Martin Aspeli wrote: Hi folks, Let's say have a two-node cluster with DRBD and OCFS2, with a database server that's supposed to be active on one node at a time, using the OCFS2 partition for its data store. *cringe* Which database is this? Postgres. Why are you cringing? From my reading, I had gathered this was a pretty common setup to support failover of Postgres without the luxury of a SAN. Are you saying it's a bad idea? PgSQL on top of DRBD is OK. PgSQL on top of OCFS2 is a disaster waiting to gnaw your leg off. -- Please forgive my ignorance, I seem to have missed the specifics about using OCFS2 on DRBD dual-primary but what are the main issues? How can you use PgSQL on dual-primary without OCFS2? Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Help with OCFS2 / DLM Stability
Sorry, please ignore this mail. Client issues! -Original Message- From: darren.mans...@opengi.co.uk [mailto:darren.mans...@opengi.co.uk] Sent: 10 March 2010 13:53 To: deja...@fastmail.fm Cc: pacemaker@oss.clusterlabs.org Subject: Re: Re: [Pacemaker] Help with OCFS2 / DLM Stability On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: Hi, =20 On Tue, Mar 09, 2010 at 11:37:02AM -, darren.mans...@opengi.co.uk wro= te: Hi everyone. =20 =20 =20 Further to some discussions a couple of weeks ago with regard to OCFS2 on SLES 11 HAE I'm looking to finally nail this problem. =20 We have a 3 node cluster that has a STONITH shootout every week. This morning one node got stuck in a state where it couldn't be fenced due the RSA not being responsive. =20 I'm not sure if the problem is due to: =20 * Network interruption causing Totem failures. * Java (Tomcat) processes falling over. =20 I suppose that those are activequote and activequoteadmin. You should increase the timeouts, 10 seconds is too short in general, and for java/tomcat probably even more so. =20 * DLM falling over. * Any of the above in any combination. =20 I've attached a hb_report. Could you see if you can see anything? =20 Any good reason to ignore quorum? For a three node cluster you should remove the no-quorum-policy property or, perhaps because of ocfs2, set it to freeze. =20 Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a SLE11 HAE update available. =20 From the logs: =20 Mar 9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: unpack_rsc_op: = Processing failed op activequote:1_monitor_1 on OGG-ACTIVEQUOTE-03: unk= nown exec error =20 Interestingly, there is no lrmd log for this on 03. =20 Then there are several operation timeouts, perhaps due to ocfs2 hanging, two activequote and activequoteadmin stop operations could not be killed even with -9, so they were probably waiting for the disk. =20 Mar 9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm ] info: pcmk_peer= _update: lost: OGG-ACTIVEQUOTE-03 504997642 =20 Do you know why the node vanished? You should try to keep your networking healthy. =20 Thanks, =20 Dejan =20 =20 =20 Thanks =20 Darren Mansell =20 =20 =20 =20 =20 =20 =20 ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker =20 =20 ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] cluster/load balancing in openvz containers
That's about where I got to last time I looked at it. Openvz and Linux-HA should be great together, there's just lots of little configuration issues to get around. I went with the ldirector / ipvsadm route but run into issues with ARP config and didn't really have enough time to look into it. If you could create a config where you have a cluster with VZ instances controlled by RA's that do live migrations it would be fantastic, a proper self-contained virtual machine cluster. -Original Message- From: wessel [mailto:wes...@techtribe.nl] Sent: 25 February 2010 10:57 To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] cluster/load balancing in openvz containers Hi, I am trying to get load balancing working on my test configuration which consist of openvz containers (I would even like to use openvz on the production machines if possible, makes configuration/migration etc easy). I have the HA part working, I created venet interfaces on the containers en added them together with the host interface in a bridge on the host. This even works with 3 containers on 2 different hardware hosts. The load balancing part is a bit more problematic: I created a clone of the ip and of the website, and it starts the web server on both containers, so till there it looks fine. But if I do requests it always seems to come from the same node. Until I put that node in standby, than I get the requests from another node. My host is ubuntu hardy 8.04, my containers are debian lenny. Could the problem be because of the http://www.linux-ha.org/ClusterIP , ipt_CLUSTERIP which is probably missing in my container kernel? I don't get any error messages in the logfile. At least none that looks related. Below is my config. Thanks for any help/suggestions! Wessel node test2 \ attributes standby=off node test3 \ attributes standby=off node test4 \ attributes standby=on primitive Website ocf:heartbeat:apache \ params configfile=/etc/apache2/apache2.conf \ op monitor interval=10s primitive failover-ip ocf:heartbeat:IPaddr \ params ip=10.111.112.34 \ op monitor interval=10s clone WebIP failover-ip \ meta globally-unique=true clone-max=2 clone-node-max=2 clone WebsiteClone Website colocation website-with-ip inf: WebsiteClone WebIP order apache-after-ip inf: WebIP WebsiteClone property $id=cib-bootstrap-options \ dc-version=1.0.7-54d7869bfe3691eb723b1d47810e5585d8246b58 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=false \ no-quorum-policy=ignore ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker?
Hello. Yes, we get the same kind of thing. SLES11 HAE 64-bit. Average uptime of the boxes is about a week at the moment. Also 3 nodes using OCFS2 / cLVMD / OCFS2: node OGG-NODE-01 node OGG-NODE-02 \ attributes standby=off node OGG-NODE-03 primitive STONITH-1 stonith:external/ibmrsa-telnet \ params nodename=OGG-NODE-01 ip_address=192.168.1.12 password=PASSWORD username=USERID \ op monitor interval=1h timeout=1m \ op startup interval=0 timeout=1m \ meta target-role=Started primitive STONITH-2 stonith:external/ibmrsa-telnet \ params nodename=OGG-NODE-02 ip_address=192.168.1.22 password=PASSWORD username=USERID \ op monitor interval=1h timeout=1m \ op startup interval=0 timeout=1m \ meta target-role=Started primitive STONITH-3 stonith:external/ibmrsa-telnet \ params nodename=OGG-NODE-03 ip_address=192.168.1.32 password=PASSWORD username=USERID \ op monitor interval=1h timeout=1m \ meta target-role=Started primitive Virtual-IP-App1 ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=192.168.1.100 cidr_netmask=24 broadcast=192.168.1.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 primitive Virtual-IP-App2 ocf:heartbeat:IPaddr2 \ params lvs_support=true ip=192.168.1.103 cidr_netmask=24 broadcast=192.168.1.255 \ op monitor interval=1m timeout=10s \ meta migration-threshold=10 primitive ldirectord ocf:heartbeat:ldirectord \ params configfile=/etc/ha.d/ldirectord.cf \ op monitor interval=2m timeout=20s \ meta migration-threshold=10 target-role=Started primitive App1 lsb:App1 \ op monitor interval=10s enabled=true timeout=10s \ meta target-role=Started primitive App2 lsb:App2 \ op monitor interval=10s enabled=true timeout=10s \ meta target-role=Started primitive dlm ocf:pacemaker:controld \ op monitor interval=120s primitive o2cb ocf:ocfs2:o2cb \ op monitor interval=2m primitive fs ocf:heartbeat:Filesystem \ params device=/dev/dm-0 directory=/opt/SAN/ fstype=ocfs2 \ op monitor interval=120s group Load-Balancing Virtual-IP-App1 Virtual-IP-App2 ldirectord clone cl-App1 App1 clone cl-App2 App2 clone dlm-clone dlm \ meta globally-unique=false interleave=true target-role=Started clone o2cb-clone o2cb \ meta globally-unique=false interleave=true target-role=Started clone fs-clone fs \ meta interleave=true ordered=true target-role=Started location l-st-1 STONITH-1 -inf: OGG-NODE-01 location l-st-2 STONITH-2 -inf: OGG-NODE-02 location l-st-3 STONITH-3 -inf: OGG-NODE-03 location Prefer-Node1 ldirectord \ rule $id=prefer-node1-rule 100: #uname eq OGG-NODE-01 colocation o2cb-with-dlm inf: o2cb-clone dlm-clone colocation fs-with-o2cb inf: fs-clone o2cb-clone order start-o2cb-after-dlm inf: dlm-clone o2cb-clone order start-fs-after-o2cb inf: o2cb-clone fs-clone order start-app1-after-fs inf: fs-clone cl-App1 order start-app2-after-fs inf: fs-clone cl-App2 property $id=cib-bootstrap-options \ dc-version=1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a \ expected-quorum-votes=3 \ no-quorum-policy=ignore \ start-failure-is-fatal=false \ stonith-action=reboot \ last-lrm-refresh=1265882628 \ stonith-enabled=true We seem to have randomly picked up a standby=off node attribute, I can't see it's causing any problems but I'm too afraid to make any changes at the moment in case we have a(nother) shootout. -Original Message- From: Sander van Vugt [mailto:m...@sandervanvugt.nl] Sent: 11 February 2010 08:30 To: pacema...@clusterlabs.org Subject: [Pacemaker] OCFS2 fencing regulated by Pacemaker? Hi, I'm trying to set up OCFS2 in a pacemaker environment (SLES11 with HAE), in a 3 node cluster. Now I succesfully configured two volumes, the dlm and the o2cb resource. But: if I shut down one of the nodes, at least one other node (and sometimes even two!) is fencing itself. I've been looking for the a way to control this behavior, but can't find anything. Does anyone have a clue? Thanks, Sander ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker?
Once again, I apologise for the top-posting. I wish I could use a real mail client but nothing apart from Outlook works properly with Exchange :(. Anyway - Yes We've had a really hard time with our 3-node SAN based cluster. We implemented OCFS2 on top of a shared disk using a o2cb and dlm clones. It seemed to work in the test environment but then when live it's been a real nightmare. It seems if you even breathe on it it will start a shootout, but as it's now a production system I can't do much about it. Some mornings we arrive in and see that all 3 servers got STONITHd overnight but we can't see any reason why. We would disable STONITH to see what state the cluster gets in before fencing but the worst that happens is we get 10 mins of service unavailability, which is a lot better than 12 hours. To complicate matters further, the apps we are using on the cluster / shared storage are Tomcat based and allegedly don't work too well with other file locking mechanisms. This is developer hearsay though, I can't substantiate it. The only leads I have are that the dlm seems to lose quorum and sets the fencing ops off. The logs never seem to tie up though, so it's very difficult to fault find. With all this in mind, I haven't been able to file any bugs or make support requests to Novell due to not knowing exactly what is causing the issue. At the moment, if we leave well alone it performs well. If I was to have to reboot a node, I would expect the others get to be fenced afterwards. Thanks for the help Darren -Original Message- From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Sent: 11 February 2010 14:12 To: pacemaker@oss.clusterlabs.org; m...@sandervanvugt.nl Subject: Re: [Pacemaker] OCFS2 fencing regulated by Pacemaker? Hi, On Thu, Feb 11, 2010 at 01:16:20PM +0100, Sander van Vugt wrote: On Thu, 2010-02-11 at 13:03 +0100, Dejan Muhamedagic wrote: Hi, On Thu, Feb 11, 2010 at 10:11:33AM -, darren.mans...@opengi.co.uk wrote: Hello. Yes, we get the same kind of thing. SLES11 HAE 64-bit. Is there a bugzilla for this? Nope. Before filing a bug, I'd first like to be as sure as possible that it really is a bug and not a problem behind the keyboard. If you have strong doubts, closing a bugzilla is easy :) BTW, this was meant for Darren actually, as it seemed like he was having really hard time dealing with his cluster. BTW: I don't see where on bugzilla.novell.com I should enter a bug for something that is in the SLES HAE (and the Bugzilla FAQ didn't help me). Use SUSE Linux Enterprise High Availability Extension for the product line. Thanks, Dejan Thanks, Sander ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0
On Tue, 2010-02-09 at 16:38 -0700, Tim Serong wrote: =20 So, by fixed I clearly meant fixed in only one of the two places that require fixing. Please try the following change (the relevant file will be /srv/www/hawk/public/javascripts/application.js):=20 This now works great, thank you :) (Sorry about the strange characters, can you get your colleagues to fix Evolution? ;) ) Regards, Darren winmail.dat___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0
On Tue, 2010-02-09 at 04:06 -0700, Tim Serong wrote: On 2/9/2010 at 09:15 PM, darren.mans...@opengi.co.uk wrote: Hi Tim. Thanks for this project, it seems to be exactly what we're looking for. Well, I certainly hope so :) I've installed it (it required spawn-fcgi too on SLES11 64) but I just get a blank page. I've looked at the page source and the divs have style=display: none. Not sure why that's happening, can you think of anything? style=display: none is used in two cases; one is for unexpanded children of a collapsible panel (but the header will still be visible). The other is if it thinks it can't see any useful information from cibadmin, in which case the expected behaviour would be an error message of some description. Can you please tell me: - What version of Pacemaker you're running - If you run cibadmin -Ql | grep cluster-infrastructure, do you see any output? If so, what? Thanks, Tim It's pacemaker-1.0.3-4.1 No output for cluster-infrastructure. But the HTML source does contain information, just display: none hides it: div id=summary style=display: none table trthStack:/th tdspan id=summary::stack/span/td/tr trthVersion:/th tdspan id=summary::version1.0.3-0080ec086ae9/span/td/tr trthCurrent DC:/thtdspan id=summary::dcdm-ha1/span/td/tr trthStickiness:/thtdspan id=summary::default_resource_stickiness0/span/td/tr trthSTONITH:/th tdspan id=summary::stonith_enabledEnabled/span/td/tr trthCluster is:/thtdspan id=summary::symmetric_clusterSymmetric/span/td/tr trthNo Quorum:/th tdspan id=summary::no_quorum_policystop/td/tr /table /div Thanks Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] ocf:heartbeat:mysql RA: op monitor
On Tue, 2010-02-09 at 17:01 +0100, Oscar Rem=C3=AD=C2=ADrez de Ganuza Satr= =C3=BAstegui wrote: Hello! =20 I have one question regarding the ocf:heartbeat:mysql RA. =20 I supposed that the following params defined on the resource were to=20 being used by the monitor operation to check the status of the mysql=20 service: test_passwd=3Dpassword test_table=3Dldirectord.connectionchec= k=20 test_user=3Dservicecheck And that's what the crm is telling me: =20 * test_user (string, [root]): MySQL test user MySQL test user * test_passwd (string): MySQL test user password MySQL test user password * test_table (string, [mysql.user]): MySQL test table Table to be tested in monitor statement (in database.table notation) =20 But in my tests, they are not working as expected: they are always=20 telling me that the service is ok, even if i do not have the user=20 servicecheck defined on the database, and also if I stop (kill=20 -SIGSTOP) the mysql process. =20 How can they be used to check the status of the mysql service? =20 Thanks very much again!! =20 --- Oscar Rem=C3=ADrez de Ganuza Servicios Inform=C3=A1ticos Universidad de Navarra Ed. de Derecho, Campus Universitario 31080 Pamplona (Navarra), Spain tfno: +34 948 425600 Ext. 3130 http://www.unav.es/SI Very odd. The RA does the following: buf=3D`echo SELECT * FROM $OCF_RESKEY_test_table | mysql --user=3D$OCF_RE= SKEY_test_user --password=3D$OCF_RESKEY_test_passwd --socket=3D$OCF_RESKEY_= socket -O connect_timeout=3D1 21` rc=3D$? if [ ! $rc -eq 0 ]; then ocf_log err MySQL $test_table monitor failed:; if [ ! -z $buf ]; then ocf_log err $buf; fi return $OCF_ERR_GENERIC; else ocf_log info MySQL monitor succeded; return $OCF_SUCCESS; fi And if I kill the MySQL process on mine the monitor detects this. What's in your logs? Darren winmail.dat___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [PATCH] Allow the user to insert a startup configuration
Perhaps change: crm_info(Using initial configuration file : %s, static_config_file); To: crm_warn(Using initial configuration file : %s, static_config_file); ? Anyone who would know to put a static config file in there in the first place would be proficient enough to look in the log file for clues about why their CIB keeps resetting? -Original Message- From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Sent: 15 December 2009 10:48 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] [PATCH] Allow the user to insert a startup configuration Hi, On Tue, Dec 15, 2009 at 11:37:48AM +0100, Andrew Beekhof wrote: Anyone else interested in this feature being added? The configuration is not explicitely given to the cluster, but placed in a file. What happens on next startup? Who removes the file so that the cluster doesn't load it again? If the answer to the last question is the admin, I'm against the feature. Thanks, Dejan On Dec 10, 2009, at 9:53 PM, frank.di...@bigbandnet.com wrote: # HG changeset patch # User Frank DiMeo frank.di...@bigbandnet.com # Date 1260478129 18000 # Branch stable-1.0 # Node ID e7067734add7f3b148cb534b85b5af256db9fad7 # Parent 381160def02a34ae554637e0a26efda850ccc015 initial load of static configuration file diff -r 381160def02a -r e7067734add7 cib/io.c --- a/cib/io.c Thu Dec 10 09:07:45 2009 -0500 +++ b/cib/io.c Thu Dec 10 15:48:49 2009 -0500 @@ -261,7 +261,7 @@ crm_err(%s exists but does NOT contain valid XML. , filename); crm_warn(Continuing but %s will NOT used., filename); -} else if(validate_cib_digest(root, sigfile) == FALSE) { +} else if(sigfile ( validate_cib_digest(root, sigfile) == FALSE )) { crm_err(Checksum of %s failed! Configuration contents ignored!, filename); crm_err(Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected;); @@ -282,11 +282,12 @@ readCibXmlFile(const char *dir, const char *file, gboolean discard_status) { int seq = 0; -char *filename = NULL, *sigfile = NULL; +char *filename = NULL, *sigfile = NULL, *static_config_file = NULL; const char *name = NULL; const char *value = NULL; const char *validation = NULL; const char *use_valgrind = getenv(HA_VALGRIND_ENABLED); + struct stat buf; xmlNode *root = NULL; xmlNode *status = NULL; @@ -300,7 +301,23 @@ sigfile = crm_concat(filename, sig, '.'); cib_status = cib_ok; -root = retrieveCib(filename, sigfile, TRUE); + + /* + ** we might drop a static config file in there as a known startup point + ** if we do, use it. Its called file.xml.static_init + */ + static_config_file = crm_concat(filename, static_init, '.'); + + crm_info(Looking for static initialization file : %s, static_config_file); + + if(stat(static_config_file, buf) == 0) { + crm_info(Using initial configuration file : %s, static_config_file); + root = retrieveCib(static_config_file, NULL, TRUE); + } + else { + crm_info(Using found configuration file : %s, filename); + root = retrieveCib(filename, sigfile, TRUE); + } if(root == NULL) { crm_warn(Primary configuration corrupt or unusable, trying backup...); @@ -308,7 +325,6 @@ } while(root == NULL) { -struct stat buf; char *backup_file = NULL; crm_free(sigfile); @@ -409,6 +425,7 @@ } } +crm_free(static_config_file); crm_free(filename); crm_free(sigfile); return root; -- Andrew ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] I Like This But...
Depends on what you're doing with it to make it challenging or not. The old Linux-HA had a very steep learning curve that isn't there as much anymore. A decent level of networking knowledge is required but the documentation is now excellent and with the CRM shell, Pacemaker is a lot easier to work with. At the moment it is a great opportunity - the technology isn't widely known and people seem to get very scared when you start talking about what it does. It means you can deliver great solutions for very low costs and you're seen as very capable for understanding it, or at least making it work. But you shouldn't worry about how difficult configuring it is for other people. If you can read the documentation, understand it and configure it correctly then all management should be interested in is the result. As you can get great results quickly and cheaply then they should have no cause to complain and every cause to give you a pay rise :) It won't be long until Pacemaker gets more exposure and then suddenly every man and his dog is using and configuring it. Until then it's great to have a technology you can rely upon to give you great results and make you look good in the process.. Darren -Original Message- From: Fraser Doswell [mailto:doswe...@acanac.net] Sent: 05 December 2009 03:56 To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] I Like This But... While pacemaker is great, the process of configuring it is still challenging - the pieces seem to be everywhere. As a consultant, this is a great opportunity - pull the pieces together and make it all work. But, any normal systems admin with a micr-addicted manager breathing down their neck will have a hard time using this software. I was in the trenches once - and understand why they may not always read the docs. Someone is always interrupting - any time of the day! Nonetheless, systems admins - please start at the beginning. Doing too much too fast creates a house of cards that is not understood. And read, read, read... THIS WILL TAKE TIME - THERE IS NO WAY AROUND IT The software is already better than the competition. Keep up the great work! Fraser Doswell Addington IR Inc. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] is ptest 1.06 working correctly?
I've never really understood the correct time to do the ptest graphs. I initiated a failover once and did the graph very quickly while it was in a transitional state but I've always wondered if there is an easier way i.e. show me a graph of the migration plan if such and such were to happen. -Original Message- From: Frank DiMeo [mailto:frank.di...@bigbandnet.com] Sent: 30 November 2009 16:56 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] is ptest 1.06 working correctly? Actually, I don't know what you mean by the phrase you don't have any transitions in live cib. Shouldn't ptest generate a graphical representation of the actions to be carried out on resources? -Frank -Original Message- From: Rasto Levrinc [mailto:rasto.levr...@linbit.com] Sent: Monday, November 30, 2009 11:38 AM To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] is ptest 1.06 working correctly? On Mon, November 30, 2009 5:21 pm, Frank DiMeo wrote: I actually did use -- on the long options, for some reason the cut/paste in MS outlook collapsed them. As you see from the enclosed files in my previous posting, the files are actually generated, there's just not much in them. Oh, I see. It is because you don't have any transitions in live cib. It works correctly as far as I can tell. Rasto -- : Dipl-Ing Rastislav Levrinc : DRBD-MC http://www.drbd.org/mc/management-console/ : DRBD/HA support and consulting http://www.linbit.com/ DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] is ptest 1.06 working correctly?
This sounds very interesting. I look forward to trying it :) (sorry for Outlook-affliction) -Original Message- From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Sent: 30 November 2009 17:28 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] is ptest 1.06 working correctly? Hi, On Mon, Nov 30, 2009 at 05:04:35PM -, darren.mans...@opengi.co.uk wrote: I've never really understood the correct time to do the ptest graphs. I initiated a failover once and did the graph very quickly while it was in a transitional state but I've always wondered if there is an easier way i.e. show me a graph of the migration plan if such and such were to happen. There's a fairly new feature in the crm shell with which it is possible to edit the status section, e.g. to simulate a resource failure or the node lost event. Then you can try the ptest command (in configure) and it will show you what would happen. This feature has not been complete at the time when 1.0.6 was released and may still change. Also, if you change the configuration and run ptest _before_ commit, that will also display the graph of what would happen if the new configuration had been committed. Thanks, Dejan -Original Message- From: Frank DiMeo [mailto:frank.di...@bigbandnet.com] Sent: 30 November 2009 16:56 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] is ptest 1.06 working correctly? Actually, I don't know what you mean by the phrase you don't have any transitions in live cib. Shouldn't ptest generate a graphical representation of the actions to be carried out on resources? -Frank -Original Message- From: Rasto Levrinc [mailto:rasto.levr...@linbit.com] Sent: Monday, November 30, 2009 11:38 AM To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] is ptest 1.06 working correctly? On Mon, November 30, 2009 5:21 pm, Frank DiMeo wrote: I actually did use -- on the long options, for some reason the cut/paste in MS outlook collapsed them. As you see from the enclosed files in my previous posting, the files are actually generated, there's just not much in them. Oh, I see. It is because you don't have any transitions in live cib. It works correctly as far as I can tell. Rasto -- : Dipl-Ing Rastislav Levrinc : DRBD-MC http://www.drbd.org/mc/management-console/ : DRBD/HA support and consulting http://www.linbit.com/ DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] no ais-keygen with ubuntu hardy and launchpad?
Try corosync-keygen and work with /etc/corosync as if it were /etc/ais/ -Original Message- From: Dirk Taggesell [mailto:dirk.tagges...@proximic.com] Sent: 04 November 2009 11:39 To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] no ais-keygen with ubuntu hardy and launchpad? Hi all, I am about to get a simple HA cluster up and running and as the docs at clusterlabs recommend, I tried openais instead of heartbeat. Thus I incorporated deb http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main deb-src http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu hardy main into /etc/apt/sources.list and installed: openais, pacemaker, pacemaker-openais along with what is pulled in as well because of dependencies. But when I want to follow this: http://clusterlabs.org/wiki/Initial_Configuration there is neither ais-keygen nor is there /etc/ais. At least /etc/corosync exists but neither openais nor corosync provide any sufficient documentation. So, did I miss a package to be installed or how does even the very basic configuration work? ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] why use ocf::linbit:drbd instead ofocf::heartbeat:drbd?
On 2009-10-10 10:37, xin.li...@cs2c.com.cn wrote: Hi all: As I known, drbd (8.3.2 and above) in pacemaker has 2 ocf scripts, one is from linbit, the other one is from heartbeat . In Andrew's Cluster form Scratch - Fedora 11 , Configure the Cluster for DRBD , he uses ocf::linbit:drbd instead of ocf::heartbeat:drbd why? Because the heartbeat one is broken. It's that simple. Don't use it. Can you say what parts are broken though? We have just completed 2 large projects using the heartbeat RA for DRBD as the Linbit version was not available at the start. SLES 11 HAE only ships DRBD 8.2.7 and using the later linbit OCF RA means compiling a later DRBD usertools and module from source and then it's not supported by Novell. We haven't encountered any problems with the heartbeat RA yet and we can't just change to the later version without lots of testing. Thanks. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Low cost stonith device
I find the riloe plugin to be very good so if you can get cheap HP servers with iLO then that could constitute a low cost STONITH device. -Original Message- From: Mario Giammarco [mailto:mgiamma...@gmail.com] Sent: 24 September 2009 19:08 To: pacema...@clusterlabs.org Subject: [Pacemaker] Low cost stonith device Hello, Can you suggest me a list of stonith devices compatible with pacemaker? I need a low cost one. I have also another idea to build a low cost stonith device: I have intelligent switches. To stonith a node I can send to a switch the command to turn off all ethernet ports linked to the node to be fenced. So the node is powered on but it cannot do any harm because it is disconnected from network. Is it a good idea? How can I implement it? Thanks in advance for any help. Mario ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Arp and configuration advice
Greetings, I have a two webserver/two database server clustered setup. I've got ldirector and LVS managed by pacemaker and configured to be able to run on either database machine. I know how to disable ARP for the machine not running ldirector, unfortunately I'm not sure how to dynamically get the webservers to update their ARP cache when ldirector gets moved upon failure. Is it possible to set up a service for the two web servers to delete the ARP cache for the VIP on the event that ldirector gets moved? The IPaddr2 RA runs send_arp when the start action is called I believe: # # Run send_arp to note peers about new mac address # run_send_arp() { ARGS=-i $ARP_INTERVAL_MS -r $ARP_REPEAT -p $SENDARPPIDFILE $NIC $BASEIP auto not_used not_used if [ x$IP_CIP = xyes ] ; then if [ x = x$IF_MAC ] ; then MY_MAC=auto else MY_MAC=`echo ${IF_MAC} | sed -e 's/://'` fi ARGS=-i $ARP_INTERVAL_MS -r $ARP_REPEAT -p $SENDARPPIDFILE $NIC $BASEIP $MY_MAC not_used not_used fi ocf_log info $SENDARP $ARGS case $ARP_BACKGROUND in yes) ($SENDARP $ARGS || ocf_log err Could not send gratuitous arps ) 2 ;; *) $SENDARP $ARGS || ocf_log err Could not send gratuitous arps ;; esac } So when the VIP is started on another node, other nodes should be notified the IP has changed hosts. Doesn't it work for you? I can build my own OCF script to update the arp cache, that's not an issue. I simply don't know how to configure pacemaker to say Oop. db2 died. Move ldirector to db1 and tell the webservers to update their ARP cache. Any suggestions? Thanks in advance! Justin Regards, Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Load-Balancing Confusion
Can anyone help me clear up my confusion with load-balancing / load-sharing using Linux-HA please? I've always used ldirectord/LVS with an IPaddr2 resource (not cloned), collocated them and put the virtual IP address on the loopback of all nodes. When the IPaddr2 resource starts on any node it will remove the VIP from the loopback on that node. Traffic hits the node with the IPaddr2 and ldirectord resources then gets redirected off to other nodes on their lo devices as they don't ARP. This has worked fine so far bar a few issues but I'm not sure I'm doing it right. I'm using lvs_support=true in my CIB to allow it to work but it doesn't do what it says, it doesn't set the IP to the loopback device on a node that isn't active, I have to do that myself. Should I be cloning the IPaddr2 so it runs on both nodes? Would this need making into a multi-state resource for this to happen? Sorry for all the questions, I've just opened a can of worms with this, all because I've just found I can't run more than one service on 127.0.0.1:80 so can't load balance more than one web server. Thanks. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource Failover in 2 Node Cluster
I've now re-installed the SLES 11 HAE DRBD module, usertools and set the cluster to use the heartbeat RA and it now fails over as expected. Does the Linbit provided RA work differently? Is the following anything to do with it from the logs? Aug 19 11:08:37 gihub2 pengine: [4837]: notice: clone_print: Master/Slave Set: MS-DRBD-Disk Aug 19 11:08:37 gihub2 crmd: [4838]: info: unpack_graph: Unpacked transition 126: 29 actions in 29 synapses Aug 19 11:08:37 gihub2 pengine: [4837]: notice: print_list: Masters: [ gihub1 ] Aug 19 11:08:37 gihub2 crmd: [4838]: info: do_te_invoke: Processing graph 126 (ref=pe_calc-dc-1250676517-500) derived from /var/lib/pengine/pe-warn-1356.bz2 Aug 19 11:08:37 gihub2 pengine: [4837]: notice: print_list: Slaves: [ gihub2 ] Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_pseudo_action: Pseudo action 36 fired and confirmed Aug 19 11:08:37 gihub2 pengine: [4837]: notice: group_print: Resource Group: Resource-Group Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_pseudo_action: Pseudo action 46 fired and confirmed Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print: FileSystem(ocf::heartbeat:Filesystem):Started gihub1 Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating action 43: stop Virtual-IP_stop_0 on gihub1 Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print: ProFTPD (lsb:proftpd): Started gihub1 Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating action 62: notify DRBD-Disk:0_pre_notify_demote_0 on gihub2 (local) Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print: Tomcat (lsb:tomcat): Started gihub1 Aug 19 11:08:37 gihub2 crmd: [4838]: info: do_lrm_rsc_op: Performing key=62:126:0:76a53bb6-ce93-4f38-81b5-f3af04223710 op=DRBD-Disk:0_notify_0 ) Aug 19 11:08:37 gihub2 pengine: [4837]: notice: native_print: Virtual-IP(ocf::heartbeat:IPaddr2): Started gihub1 Aug 19 11:08:37 gihub2 crmd: [4838]: info: te_rsc_command: Initiating action 65: notify DRBD-Disk:1_pre_notify_demote_0 on gihub1 Aug 19 11:08:37 gihub2 pengine: [4837]: WARN: native_color: Resource DRBD-Disk:1 cannot run anywhere From: darren.mans...@opengi.co.uk [mailto:darren.mans...@opengi.co.uk] Sent: 19 August 2009 10:20 To: pacemaker@oss.clusterlabs.org Subject: [Pacemaker] Resource Failover in 2 Node Cluster Hello everyone. I'm a little confused about how this set up should work. This is my config: node gihub1 node gihub2 primitive stonith-SSH stonith:ssh \ params hostlist=gihub1 gihub2 primitive DRBD-Disk ocf:linbit:drbd \ params drbd_resource=gihub_disk \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=60s role=Slave timeout=30s primitive FileSystem ocf:heartbeat:Filesystem \ params fstype=ext3 directory=/www device=/dev/drbd0 \ op monitor interval=30s timeout=15s \ meta migration-threshold=10 primitive ProFTPD lsb:proftpd \ op monitor interval=20s timeout=10s \ meta migration-threshold=10 primitive Tomcat lsb:tomcat \ op monitor interval=20s timeout=10s \ meta migration-threshold=10 primitive Virtual-IP ocf:heartbeat:IPaddr2 \ params ip=2.21.4.45 broadcast=2.255.255.255 nic=eth0 cidr_netmask=8 \ op monitor interval=30s timeout=15s \ meta migration-threshold=10 group Resource-Group FileSystem ProFTPD Tomcat Virtual-IP ms MS-DRBD-Disk DRBD-Disk \ meta clone-max=2 notify=true globally-unique=false clone STONITH-clone stonith-SSH location DRBD-Master-Prefers-GIHub1 MS-DRBD-Disk \ rule $id=drbd_loc_rule $role=master 100: #uname eq gihub1 colocation Resource-Group-With-DRBD-Master inf: Resource-Group MS-DRBD-Disk:Master order Start-DRBD-Before-Filesystem inf: MS-DRBD-Disk:promote FileSystem:start property $id=cib-bootstrap-options \ dc-version=1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ start-failure-is-fatal=false \ stonith-action=poweroff \ last-lrm-refresh=1250615730 \ stonith-enabled=false I had assumed (and I'm sure it worked like this before) that if I reboot gihub1, all the resources should instead start on gihub2. I have tried with stonith-enabled=true which doesn't seem to change anything. Can anyone see from my config or the attached messages log what is going on? I've compiled DRBD 8.3.2 and I'm using the new Libit DRBD RA. I'll try using the heartbeat RA in the meantime. Many thanks Darren Mansell ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Temporarily Stop Cloned Resource on 1 Node
Hello all. I have a cloned resource that I need to stop temporarily on 1 node. Am I missing something quite obvious because I can't figure out how to do it without reconfiguring the CIB. Pacemaker 1.0.3 on SLES 11. Thanks. Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] setup for flexlm lmgrd failover
Hi I'm looking for a How-To on setting up Pacemaker for a failover pair of suse 10.2 flexlm license managers servers. For both Portland Group and Intel compilers float licenses. mac address take-over then start lmgrd etc. Many thanks Jonathan ___ Hello. Can I just suggest you use SLES 11 HAE instead of SLES 10.2? The former works much better for HA. Darren ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker