Re: [Linux-HA] always have to cleanup LSB script on failover
[EMAIL PROTECTED] wrote: Hello list, I have an ordered and collocated group that consists of the following elements that startup in order: Resource Group: GROUPS_KNWORKS_mail drbddisk_knworks_mail (heartbeat:drbddisk): Started asknmapr01 drbddisk_axigenbin (heartbeat:drbddisk): Started asknmapr01 ip_knworks_mail (heartbeat::ocf:IPaddr2): Started asknmapr01 fs_knworks_mail (heartbeat::ocf:Filesystem):Started asknmapr01 fs_axigen_bin (heartbeat::ocf:Filesystem):Started asknmapr01 ip_knworks_mail_external(heartbeat::ocf:IPaddr2): Started asknmapr01 axigen_initscript (lsb:axigen): Started asknmapr01 axigenfilters_initscript(lsb:axigenfilters):Started asknmapr01 whenever I failover/migrate the group between nodes, everything works just as expected, however the 2 bottom LSB scripts never start. they just stay in stopped mode until I run the following commands: for x in asknmapr01 asknmapr02; do crm_resource -C -r axigenfilters_initscript -H $x; done for x in asknmapr01 asknmapr02; do crm_resource -C -r axigen_initscript -H $x; done These 2 command run cleanup for both scripts, on both nodes. after I run them the scripts start fine without any action from me. upon migration again, the same thing occurs. I have to cleanup the scripts once again to get them to run. Any ideas would be very appreciated. Thanks! Did you check your script is LSB compliant? A howto is here: http://wiki.linux-ha.org/LSBResourceAgent Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource reached -INIFINITY
Due to a temporary initialization problem, a resource reached a -INIFINITY score on one node. Is there a way to instruct the heartbeat to recalculate the score of the resource on the node without restart the heartbeat? Clean the resource crm_resource -C -r $res -H $node Any maybe you have to reset the failcount. But that depends on what happened and on your configuration. crm_failcount -D -r $res -U $node Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Almost done with my HA setup, but somethign not working
Nick Duda wrote: (sorry for the long email, but all my configs are here to view) I posted before about HA with 2 squid servers. It's just about done, but stumbling on something. Everytime i manually cause something to happen in hopes to see it failover, it doesnt. For example, I get crm_mon to show everything as I want it, and when I kill squid (and prevent the xml from restarting it) it just goes into a failed state...more below. Anyone see anything wrong with my configs? Server #1 Hostname: ha-1 eth0 - lan (192.168.95.1) eth1 - xover to eth1 on other server Server #2 Hostname: ha-2 eth0 - lan (192.168.95.2) eth1 - xover to eth1 on other server ha.cf on each server: bcast eth1 mcast eth0 239.0.0.2 694 1 0 node ha-1 ha-2 crm on Not using haresources because of crm Here is the output from crm_mon: Last updated: Mon Apr 21 15:44:53 2008 Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d) 2 Nodes configured. 1 Resources configured. Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online Resource Group: squid-cluster ip0 (heartbeat::ocf:IPaddr2): Started ha-1 squid (heartbeat::ocf:squid): Started ha-1 If squid stops on the current heartbeat serer, ha-1, it will restart within 60sec...so the scripting is working. If i stop the squid process and rename it in /etc/init.d/squid to something else, the script wont be able to execute the squid start and should failover to ha-2, but it doesnt, instead this appears (on both ha-1 and ha-2): What exactly do you rename and how? It's likely the cluster is behaving sane and you're just creating a testcase you don't understand. Regards Dominik Last updated: Mon Apr 21 15:47:49 2008 Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d) 2 Nodes configured. 1 Resources configured. Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online Resource Group: squid-cluster ip0 (heartbeat::ocf:IPaddr2): Started ha-1 squid (heartbeat::ocf:squid): Started ha-1 (unmanaged) FAILED Failed actions: squid_stop_0 (node=ha-1, call=74, rc=1): Error ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Almost done with my HA setup, but somethign not working
Nick Duda wrote: I rename the restart script for squid. Your OCF Script or your /etc/init.d script? My current setup (based on examples on the web) show that if squid fails on the current runing server it will try to restart itself. If restart fails it will failover. So basically I am trying to make a test case scenario that if the squid startup script in /etc/init.d got deleted Ah, your /etc/init.d script. Okay, look at your OCF script, what it does when /etc/init.d/squid is not there. --- INIT_SCRIPT=/etc/init.d/squid case $1 in start) ${INIT_SCRIPT} start /dev/null 21 exit || exit 1 ;; stop) ${INIT_SCRIPT} stop /dev/null 21 exit || exit 1 ;; status) ${INIT_SCRIPT} status /dev/null 21 exit || exit 1 ;; monitor) # Check if Ressource is stopped ${INIT_SCRIPT} status /dev/null 21 || exit 7 # Otherwise check services (XXX: Maybe loosen retry / timeout) wget -o /dev/null -O /dev/null -T 1 -t 1 http://localhost:3128/ exit || exit 1 ;; meta-data) -- So for the next monitor operation, it will exec ${INIT_SCRIPT} status /dev/null 21 || exit 7 This will propably return 7. So the cluster thinks your resource is stopped. As it was running before (I guess?), the cluster will now try to stop and start it. Stop calls stop /dev/null 21 exit || exit 1 This will return 1. So the stop operation failed. With stonith, your node would be rebooted now. I don't see a stonith device, so the resource goes unmanaged. I think what you see is intended. Regards Dominik and squid crashed it should failover to the other box.its not. Dominik Klein wrote: Nick Duda wrote: (sorry for the long email, but all my configs are here to view) I posted before about HA with 2 squid servers. It's just about done, but stumbling on something. Everytime i manually cause something to happen in hopes to see it failover, it doesnt. For example, I get crm_mon to show everything as I want it, and when I kill squid (and prevent the xml from restarting it) it just goes into a failed state...more below. Anyone see anything wrong with my configs? Server #1 Hostname: ha-1 eth0 - lan (192.168.95.1) eth1 - xover to eth1 on other server Server #2 Hostname: ha-2 eth0 - lan (192.168.95.2) eth1 - xover to eth1 on other server ha.cf on each server: bcast eth1 mcast eth0 239.0.0.2 694 1 0 node ha-1 ha-2 crm on Not using haresources because of crm Here is the output from crm_mon: Last updated: Mon Apr 21 15:44:53 2008 Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d) 2 Nodes configured. 1 Resources configured. Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online Resource Group: squid-cluster ip0 (heartbeat::ocf:IPaddr2): Started ha-1 squid (heartbeat::ocf:squid): Started ha-1 If squid stops on the current heartbeat serer, ha-1, it will restart within 60sec...so the scripting is working. If i stop the squid process and rename it in /etc/init.d/squid to something else, the script wont be able to execute the squid start and should failover to ha-2, but it doesnt, instead this appears (on both ha-1 and ha-2): What exactly do you rename and how? It's likely the cluster is behaving sane and you're just creating a testcase you don't understand. Regards Dominik Last updated: Mon Apr 21 15:47:49 2008 Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d) 2 Nodes configured. 1 Resources configured. Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online Resource Group: squid-cluster ip0 (heartbeat::ocf:IPaddr2): Started ha-1 squid (heartbeat::ocf:squid): Started ha-1 (unmanaged) FAILED Failed actions: squid_stop_0 (node=ha-1, call=74, rc=1): Error ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- IN-telegence GmbH Co. KG Oskar-Jäger-Str. 125 50825 Köln Registergericht Köln - HRA 14064, USt-ID Nr. DE 194 156 373 ph Gesellschafter: komware Unternehmensverwaltungsgesellschaft mbH, Registergericht Köln - HRB 38396 Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Constraint: Two drdb master on the same node?
I have the following problem with a two-node cluster: I have two DRBD resources. On the node where drbd0 is master, a certain resource group with different resources will be activated. On the node where drbd1 is master, this will happen with another resource group. You can get the necessary constraints from the DRBD Howto: http://wiki.linux-ha.org/DRBD/HowTov2 Now I want that the two DRBD resources are always master on the same node? How do I create such a constraint the cib/crm file? rsc_colocation id=master_on_master to=ms-drbd3 to_role=master from=ms-drbd1 from_role=master score=infinity/ Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] New questions relating to: Methods of dealing with network fail(ure/over)
Stallmann, Andreas wrote: Hi there! I have set up a two-node heartbeat cluster running apache and drbd. Everthing went fine, till we tested a split brain scenario. In this case, when we detach both network cables from one host, we get a two-primary situation. I read in the thread methods of dealing with network failover that setting up stonith and a quorum-node might be a good workaround. Well... it isn't in our situation, I think. Let's assume, we have the following scenario: - The two nodes, having two interfaces each, monitor each other via unicast queries over both interfaces. - We do not have any dedicated cross-over or serial connections, because the servers reside in buildings a few kilometers appart. - We have only the two Linux nodes in our network which are part of our cluster (well, a few more to be honest, but those are the two we may fiddle arround with). - We won't be able to set up a (dedicated) quorum server. - We do not have a network enabled power socket we might deactivate for the node which we want to shot in the head. Now someone stumbles over the network cables of, lets say, node-b, detaching it from the network. node-b and node-a do not receive any unicast replies from their peer anymore, but node-a can still ping it's ping host, while node-a can't. node-b should now assume, that it's very likely dead. node-a should assume can't be sure, because it can't reach it's peer but still can reach the rest of the network (or at least it's ping node). Actually, I'd like to see the following happen: - If a node is secondary and assumes, that it's very likely dead, it should not be allowed to take over any ressources. - If a node is primary and isn't sure about it's peer, it should freeze it's state at least till it's peer is reachable over one interface. That's about exactly what the dopd (drbd peer outdater daemon) is for. Look into http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/ Dopd was rather unuseable for the last few weeks(/months?), but I read it recently received a bunch of fixes and is supposed to work now (as of the drbd-user mailing list and Lars Ellenberg (one of the main authors of drbd)). Refer to that mailing-list and the blog entry. I'd be glad if you told us how this worked out. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Three questions on failcount attribute
[EMAIL PROTECTED] sbin # ./crm_failcount -G -U isdl601 -r caebench.proc name=fail-count-caebench.proc value=(null) Error performing operation: The object/attribute does not exist Is this intentional? At least the normal behaviour. in that version Ah right. crm_failcount gives a reasonable answer now - I just did cibadmin -Q|grep fail and never found anything with 0 failures :) From a consistency point of view, creating it with a value of 0 would make sense. RFE? b) It would be nice to have crm_mon display the failcounts for all resources on all nodes, or mark nodes with any failcount 0 as failed or online/failed. RFE? Yep, would be nice. Request that as an enhancement in the bugzilla. which is what happens in newer version :) I agree one can see a failed operation. But the failcount in crm_mon? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to config to the cluster to perform a simple two node (master-slave) cluster?
Hi to run all resources on the same node, you could put them in a group. Read http://wiki.linux-ha.org/ClusterInformationBase/ResourceGroups If you want to decide which node the group is usually located at, you need a rsc_location constraint. An example is also on that page. To move the group to another node after a resource failure, you have to look into resource_failure_stickiness. Read http://wiki.linux-ha.org/ScoreCalculation. Regards Dominik [EMAIL PROTECTED] wrote: hi, i want to construct a two-node cluster. 1. firstly, all of the resources is started up on the DC node. 2. when some of critical resources error occured , all of the resources is move to the other node. 3. all resources must run on only one node in its whole lifetime. until error occured they are moved to the other node. i know some of attributes(like: resource stickiness) can locate the resource in one place. any more advice ? thank you ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ordering question
William Francis wrote: http://linux-ha.org/DRBD/HowTov2 it has the example rsc_order id=drbd0_before_fs0 from=fs0 action=start to=ms-drbd0 to_action=promote/ This reads: start fs0 after ms-drbd0 promote which seems to mean promote ms-drbd0 (the to) THEN start fs0 (the from) But if you look at this page http://www.linux-ha.org/ClusterInformationBase/ResourceGroups it has, for example rsc_order id=database_before_apache from=WebServerDatabase action=start type=before to=WebServerApache symmetrical=TRUE/ Notice type=before, which is not the default. So this reads: start Webserverdatabase before Webserverapache You can set type to after or before. after is the default. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Three questions on failcount attribute
Martin Knoblauch wrote: Hi, three questions on the failcount attribute. I am running 2.0.8, and yes I know I should upgrade ... :-( Good to know you know :) a) Is it possible that the failcount for a ressource/node is only available after a failure? On a not-yet-failed ressource I see: [EMAIL PROTECTED] sbin # ./crm_failcount -G -U isdl601 -r caebench.proc name=fail-count-caebench.proc value=(null) Error performing operation: The object/attribute does not exist Is this intentional? At least the normal behaviour. From a consistency point of view, creating it with a value of 0 would make sense. RFE? b) It would be nice to have crm_mon display the failcounts for all resources on all nodes, or mark nodes with any failcount 0 as failed or online/failed. RFE? Yep, would be nice. Request that as an enhancement in the bugzilla. c) Something I likley will be shot for :-) If I set the failcount of a ressource/node to e.g. -5, does this mean that the resoource can fail 6 times before the node is no longer eligible to run the ressource? No. Something like this can be achieved with resource_failure_stickiness. Read http://wiki.linux-ha.org/ScoreCalculation Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_failcount queries quite slow?
Lars Marowsky-Bree wrote: On 2008-04-03T13:59:36, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Any crm* program is significantly slower on a non-DC node regardless of whether something's happening in the cluster. It's always been like that. I can confirm that. It's been for me ever since I started using heartbeat. Hm, I've not personally observed that in my test cluster, or at least not noticed anything out of line. Significantly slower is bad; we mandate that DC or not DC is _not_ the question, and that users shouldn't care about this designation. Could anyone who reproduces this report a few more details? Is it the local node, the time it takes to process on the DC, or the network roundtrip? (Should be observable using tcpdump/wireshark) Just 2 measurements: dktest2sles10:~# time crmadmin -D Designated Controller is: dktest2sles10 real0m0.005s user0m0.004s sys 0m0.000s dktest1sles10:~/cib# time crmadmin -D Designated Controller is: dktest2sles10 real0m1.014s user0m0.000s sys 0m0.004s dktest2sles10:~# time cibadmin -Q /dev/null real0m0.009s user0m0.004s sys 0m0.004s dktest1sles10:~/cib# time cibadmin -Q /dev/null real0m1.713s user0m0.004s sys 0m0.004s tcpdump: y.x.z.103 is the DC y.x.z.102 is the other node 08:22:16.803702 IP 10.200.200.102.32952 10.200.200.103.694: UDP, length 217 08:22:16.803626 IP 10.250.250.102.32951 10.250.250.103.694: UDP, length 221 08:22:16.803637 IP 10.250.250.102.32951 10.250.250.103.694: UDP, length 217 08:22:16.929482 IP 10.250.250.103.32869 10.250.250.102.694: UDP, length 221 08:22:16.929528 IP 10.200.200.103.32870 10.200.200.102.694: UDP, length 221 up to here, it's been just the normal heartbeat packets I think. Notice the roughly identical length. Then I do: debian dktest1sles10:~/cib# date +%H:%M:%S:%N; time cibadmin -Q /dev/null 08:22:16:04482 real0m1.189s user0m0.008s sys 0m0.00 08:22:16.929976 IP 10.250.250.103.32869 10.250.250.102.694: UDP, length 2263 08:22:16.930026 IP 10.200.200.103.32870 10.200.200.102.694: UDP, length 2263 08:22:16.930029 IP 10.200.200.103 10.200.200.102: udp 08:22:16.929979 IP 10.250.250.103 10.250.250.102: udp Both servers received an ntpdate sync against the same timesource a minute earlier. So to me, it looks like it's the DC who needs some time to process the request. The cluster had one primitive resource at that time and should have been pretty much idle. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] lsb resource problem
Hi William Francis wrote: Ubuntu 7.10 with DRBD 8.0.3 and Heartbeat 2.1.2 with an updated Filesystem file kernel 2.6.22-14 (updated from stock) I have possibly two problems, a heartbeat and a DRBD issue. My goal is to get a pair of machines working with a large /opt partition for zimbra (my mail server software) and an a virtual IP. 1. I can configure heartbeat and DRBD with an virtual IP with no problems at all. I can start and stop heartbeat on the two machines and because of the colocations I have set up the resources move around properly with no problems. If I start zimbra manually on the machine that currently has the /opt partition mounted and the virtual IP it works with no problem (I installed it with no issues). I then add Zimbra, a lsb resource, like so: primitive id=zimbra class=lsb type=zimbra/ You did check that this script is LSB compliant? If not, see http://wiki.linux-ha.org/LSBResourceAgent and change the script if necessary. in crm_mon, I can see it start the zimbra resource (on the machine with the other resources). However, after several seconds it reports a failure and I see something like this in crm_mon Master/Slave Set: ms-drbd0 drbd0:0 (heartbeat::ocf:drbd): Master d243 drbd0:1 (heartbeat::ocf:drbd): Started d242 fs0 (heartbeat::ocf:Filesystem):Started d243 ip_resource (heartbeat::ocf:IPaddr):Started d243 zimbra (lsb:zimbra): Started d243 (unmanaged) FAILED Failed actions: zimbra_start_0 (node=d243, call=7, rc=1): Error zimbra_stop_0 (node=d243, call=8, rc=1): Error Well, as you can see, the start operation failed. Therefore, the resource is stopped afterwards (notice the larger call number). But the stop operation also failed. So as the cluster cannot say what status this resource is in, it will not be touched anymore. Actually, if you had stonith configured, your node would be rebooted now, but that's another topic. It should be noted that zimbra takes a long time to start and stop, maybe as long a two minutes Then you should set an appropriate value for timeout in the start operation. Something like this: primitive ... operations op id=zimbra-start-op name=start timeout=120s/ /operations /primitive since it launches many sub processes. If there is a way to take that into account, I don't know where to do it. Also, I have made rsc_order and rsc_colocation constraints but I have the same results as here. If I start zimbra but it's init.d script and then 'echo $?' it returns 0 and starts properly. That's because the default timeout is 20s (or something in that range at least) and that seems to not be enough for you. What I don't get is that it looks like it's trying to start zimbra before DRBD is active even though I have a rsc_order set not to do so. The constraints are below and I've included a small part of the logs at the bottom. It seems to fail because it can't write out to a file on /opt, which it can't do because it's not mounted. 2. Let's say I restart heartbeat on the other machine. DRBD does not seem to reconnect properly and I get stuck with them in WFReportParams/WFBitMapT and I have yet to find a way outside of rebooting one machine to fix this. this only happens when I have zimbra as a resource and when nothing is really using /opt I can switch back and forth with no problems. I've seen some reports that this might be DRBD/kernel version problem but it seem like most of those were under DRBD 7. For master/slave resources, you should use a newer version of heartbeat and especially of the crm (which is now called pacemaker and needs to be installed separately). To install newer version, please read http://www.clusterlabs.org/mw/Install and use this pacemaker version: http://hg.clusterlabs.org/pacemaker/stable-0.6/archive/tip.tar.gz I have removed all files in the rc*.d directories for drbd and zimbra. much of this was taken directly from faqs and howtos I will happily provide logs or other debugging info. configs to follow [EMAIL PROTECTED]:/root/tmp# cat /proc/drbd version: 8.0.3 (api:86/proto:86) SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-25 00:46:06 0: cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 [EMAIL PROTECTED]:/etc/init.d# cat /proc/drbd version: 8.0.3 (api:86/proto:86) SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-24 16:02:09 0: cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown C r--- ns:4 nr:42960 dw:43504 dr:45105 al:0 bm:7 lo:2 pe:0 ua:0 ap:1 resync: used:0/31 hits:51 misses:7 starving:0 dirty:0 changed:7 act_log: used:1/257 hits:136 misses:1 starving:0 dirty:0 changed:0 drbd.conf global { usage-count yes; } common { syncer { rate 50M; } } resource drbd0 { protocol C;
Re: [Linux-HA] Re: pingd problem
Achim Stumpf wrote: Hi, Now it works. I have changed in cib.xml: rsc_location id=group_1:connected rsc=group_1 rule id=group_1:connected:rule score_attribute=pingd expression id=group_1:connected:expr:defined attribute=pingd operation=defined/ /rule /rsc_location to rsc_location id=group_1:connected rsc=group_1 rule id=group_1:connected:rule score=-INFINITY boolean_op=or expression id=group_1:connected:expr:undefined attribute=pingd operation=not_defined/ expression id=group_1:connected:expr:zero attribute=pingd operation=lte value=0/ /rule /rsc_location Now it works as expected. Those two setups were described on: http://www.linux-ha.org/pingd But still it would be nice to get pingnodes working with scores as made in my first example or described on taht page in Quickstart - Run my resource on the node with the best connectivity. Does anyone have any hints how to get that stuff working? That should work - what did you see (and what did you expect)? You propably need to adjust the pingd multiplier. showscores.sh helps a lot here to figure out what's not as expected (see http://www.linux-ha.org/ScoreCalculation ) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: pingd problem
/primitive primitive class=ocf id=apache2_1 provider=heartbeat type=apache operations op id=apache2_1_mon interval=60s name=monitor timeout=55s/ /operations instance_attributes id=apache_2_1_inst_attr attributes nvpair id=apache_2_1_attr_0 name=configfile value=/etc/httpd/conf/httpd.conf/ nvpair id=apache_2_1_attr_1 name=statusurl value=http://127.0.0.1/server-status// nvpair id=apache_2_1_attr_2 name=options value=-DSSL/ /attributes /instance_attributes /primitive instance_attributes id=group_1_instance_attrs attributes nvpair id=group_1_target_role name=target_role value=started/ nvpair id=group_1_resource_stickiness name=resource_stickiness value=200/ /attributes /instance_attributes Apart from the fact that these attributes should be meta_attributes instead of instance_attributes, this will give you a score of 4 * 200 = 800 for the node the group is actually running on. So with ping working, you should have scores of 800 + 500 for node1 500 for node2 Now you block icmp on node1. You will have: 800 on node1 500 on node2 So why should the cluster move any resource? Regards Dominik /group /resources constraints rsc_location id=rsc_location_group_1 rsc=group_1 rule id=prefered_location_group_1 score=100 expression attribute=#uname id=prefered_location_group_1_expr operation=eq value=sputnik.test/ /rule /rsc_location rsc_location id=group_1:connected rsc=group_1 rule id=group_1:connected:rule score_attribute=pingd expression id=group_1:connected:expr:defined attribute=pingd operation=defined/ /rule /rsc_location /constraints /configuration status/ /cib It would be nice to acutally get this setup working with scores, but it does not work as you see in the logs. There won't be a failover to the other node. Any hints, how i could get working the setup with scores as above? Thanks, Achim Dominik Klein schrieb: Achim Stumpf wrote: Hi, Now it works. I have changed in cib.xml: rsc_location id=group_1:connected rsc=group_1 rule id=group_1:connected:rule score_attribute=pingd expression id=group_1:connected:expr:defined attribute=pingd operation=defined/ /rule /rsc_location to rsc_location id=group_1:connected rsc=group_1 rule id=group_1:connected:rule score=-INFINITY boolean_op=or expression id=group_1:connected:expr:undefined attribute=pingd operation=not_defined/ expression id=group_1:connected:expr:zero attribute=pingd operation=lte value=0/ /rule /rsc_location Now it works as expected. Those two setups were described on: http://www.linux-ha.org/pingd But still it would be nice to get pingnodes working with scores as made in my first example or described on taht page in Quickstart - Run my resource on the node with the best connectivity. Does anyone have any hints how to get that stuff working? That should work - what did you see (and what did you expect)? You propably need to adjust the pingd multiplier. showscores.sh helps a lot here to figure out what's not as expected (see http://www.linux-ha.org/ScoreCalculation ) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: pingd problem
Achim Stumpf wrote: This will give you a pingd score of 500. A ping_group is treated as one ping_host score wise. If you want to take each ping hosts connectivity into play, you should have ping 10.14.0.10 ping 10.14.0.11 ping 10.14.0.12 ping 10.14.0.13 instead. This would give a pingd score of 2000 (and make your setup work score-wise). I know that ping_group is treated as one ping_host score wise. I expect the score to be 500 for that. instance_attributes id=group_1_instance_attrs attributes nvpair id=group_1_target_role name=target_role value=started/ nvpair id=group_1_resource_stickiness name=resource_stickiness value=200/ /attributes /instance_attributes Apart from the fact that these attributes should be meta_attributes instead of instance_attributes, this will give you a score of 4 * 200 = 800 for the node the group is actually running on. So with ping working, you should have scores of 800 + 500 for node1 500 for node2 Now you block icmp on node1. You will have: 800 on node1 500 on node2 So why should the cluster move any resource? Ah, ok. so its better to change this to: meta_attributes id=group_1_instance_attrs attributes nvpair id=group_1_target_role name=target_role value=started/ nvpair id=group_1_resource_stickiness name=resource_stickiness value=200/ /attributes /meta_attributes And the score of 200 counts for every primitive in the group. Ok, so it's 800. I thought it counts only one time. Again, read http://wiki.linux-ha.org/ScoreCalculation Apart from my misunderstanding here with 4*200 score, does my setup work score-wise now? Or do I miss anything? I don't know the config you use now, but if you address the score issue I pointed out, I guess it should. The cib looked ok. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cibadmin question
Jason Erickson wrote: I am trying to add a resource with this command. cibadmin -C -o resources -x meatware_stonithcloneset.xml It is telling me could not parse input file here is the xml file as well. clone id=meat_stonith_cloneset − This - is not actually in there, is it? If it is, get rid of it. Other than that - it works for me. Which version are you using? Regards Dominik instance_attributes id=meat_stonith_cloneset − attributes nvpair id=meat_stonith_cloneset-01 name=clone_max value=2/ nvpair id=meat_stonith_cloneset-02 name=clone_node_max value=1/ nvpair id=meat_stonith_cloneset-03 name=globally_unique value=false/ nvpair id=meat_stonith_cloneset-04 name=target_role value=started/ /attributes /instance_attributes − primitive id=meat_stonith_clone class=stonith type=meatware provider=heartbeat − operations op name=monitor interval=5s timeout=20s prereq=nothing id=meat_stonith_clone-op-01/ op name=start timeout=20s prereq=nothing id=meat_stonith_clone-op-02/ /operations − instance_attributes id=meat_stonith_clone − attributes nvpair id=meat_stonith_clone-01 name=hostlist value=lin lor/ /attributes /instance_attributes /primitive /clone Jason Erickson ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_failcount queries quite slow?
Abraham Iglesias wrote: Hi all, i'm trying to implemente my snmp mib module to get every resource failcount in the cluster. I'm surprised that the crm_failcount query to get the failcount for a resource takes 2-3 seconds. Then, for 8 resources in the cluster it takes 16-24s and it's quite low performant. Is it normal that the failcount query takes so long? Run it on the DC. Should be way faster there. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_failcount queries quite slow?
Abraham Iglesias wrote: thank you for the answer, dominik. As you said, in the DC is faster, but i need to run these queries on every node, as every node can be asked for that information. :S crm_failcount -U ;) Is there any cached data about these values? Or a static file where the results are stored? You could also grep from /var/lib/heartbeat/crm/cib.xml Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_failcount queries quite slow?
Abraham Iglesias wrote: Hi, thanks for the failcount tip ;) By thw way, i'm using 2.0.8 heartbeat. There is no information in cib.xml about failcounts...is it possible? or am I missing anything? No. I was wrong. The failcount is in the status section, which is never written to disc. Sorry about that. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA maintenance mode
I don't see an option to specify the behavior for the stand-by mode in the manual. I just wanna prevent HA from moving resources to other nodes for maintenance purpose. So basically, you want to stop all resources at once, don't you? Here's a feature request: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1862 Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] adding DRBD group problem
B) put everything in a group in the master_slave resource? ... never tried this by myself I don't think this would work without changing the all the groups resource agents to be master/slave resource agents. Every resource within the master_slave would be promoted/demoted etc. and that is propably not possible out of the box. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: Re: showscores.sh weirdness and Not failing over after, repeated kills of IPaddr2?
Roland G. McIntosh wrote: Dominik Klein wrote: With a failure stickiness of -30, you allow your groups resources to fail (400/30)=14 times. Is that what you want? Although the default failure stickiness is -30, the group has a failure stickiness of -100. I would like to failover after 3 or 4 failures. My test with 15 stop commands was just to be sure. Well, with your new cib, that should work as intended. The cib of your first email would not and what else should I refer to :) You don't have any monitor operations for the ipaddr and jboss resources. Failures on them are not detected. Configure monitor operations and try again. I actually do have monitor operations on both, I accidentally sent out an old cib.xml, updated file attached. Previous showscores.sh output is correct for this cib.xml. It behaves as described in the last e-mail even with monitor operations on IPaddr2 and jboss. Also make sure you use a recent version.Otherwise you may also hit the bug of not increasing failcount in 2.1.3's crm. This is fixed in pacemaker (0.6.x) Uh oh. I definitely have 2.1.3 with the crm_failcount bug, but I didn't think this would affect score calculation. If the failcount is not updated, the score is not changed when a resource fails again. Read http://wiki.linux-ha.org/ScoreCalculation I didn't install a pacemaker package, I used the CentOS4 extras RPMs. I hope CentOS4 / RHEL4 packages can be released. I could not rebuild RHEL5 packages from the openSUSE ha-clustering repository due to this: configure:3065: gcc -c -O2 -g conftest.c 5 conftest.c:2: error: syntax error before me configure:3071: $? = 1 configure: failed program was: | #ifndef __cplusplus |choke me | #endif Should I seek an alternative to these CentOS 4 extras RPMs? Can't give any advice on that. Any RH/CentOS users? pps. where did you get the jboss RA? I'd be interested in it. http://rgm.nu/jbossocf I hacked it together, it's ugly. Hope it's useful to you, although it's tailored to a very old jboss release 3.0.8, with some customizations to support multiple instances of jboss on different ports. Relies on ps, awk, egrep, and curl, only tested on RHEL4. You'll want to change the HTTPCODE check with an URL for your servlet (or modify to use jmx-console). Thanks. I will have a look at it and see if I can make it work for me :) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] slave's drbd resource doesn't get promote when master dies
I think I have found out my problem though: I didn't put the resource location stuff for pingd. I added this snippet to the CIB to constrain the master-slave drbd resource to not run on a node with lost connectivity and so far in my tests it seems to work: rsc_location id=drbd_id:connected rsc=ms-drbd_id rule id=drbd_id:connected:rule score=-INFINITY boolean_op=or expression id=drbd_id:connected:expr:undefined attribute=pingd operation=not_defined/ expression id=drbd_id:connected:expr:zero attribute=pingd operation=lte value=0/ /rule /rsc_location Slightly OT, but with this config: consider that if all your ping nodes are down, your resource will not run anywhere. If thats okay with you, stay with that config. Otherwise you might want to set something like this: rsc_location id=res-connected rsc=res rule role=master id=res-connected-rule score_attribute=ping1 expression id=res-connected-rule-1 attribute=pingd operation=defined/ /rule /rsc_location This will *add* the pingd attributes value to the score for the master role. If pingd loses connectivity, this will be expressed in a lower score (as the pingd attribute value is 0 then). If the other node has a higher score, the resource will be migrated. You will have to play around with the pingd multiplier according to your number of ping nodes to get those values right. You can use my showscores script to see the scores and test: http://wiki.linux-ha.org/ScoreCalculation#head-4355c45fc51c60c8e0f8a063bdc4069fdc17f761 and when I put the master in stanby the resources are correctly migrated. Same goes when I poweroff the master or yank the eth0 network cable. I still haver issues about failover as it seems that 'auto_failback off' is not honored correctly. That is a R1 configuration option. It does nothing in R2 (crm) mode. Read http://wiki.linux-ha.org/ScoreCalculation and set an appropriate value for resource stickiness. This way you can achieve a behaviour as with auto_failback off in R1. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Virtual IP
Jason Erickson wrote: The only part that i am confused about is where do you set the resource score for a node? Within the constraints section of the cib. Something like this: constraints rsc_location id=rscloc-webserver rsc=webserver rule id=rscloc-webserver-rule-1 score=200 expression id=rscloc-webserver-expr-1 attribute=#uname operation=eq value=node1/ /rule /constraints Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] showscores.sh weirdness and Not failing over after repeated kills of IPaddr2?
Hi Roland G. McIntosh wrote: No matter how many times I kill IPaddr2 I can't seem to cause a failover in my simple 2 node cluster. OT, but why do people keep calling 2 node clusters simple clusters? Clusters are not simple. Maybe it's a rather small cluster. I'm trying to get it working for the 3 services in my group (HB 2.1.3 on RHEL4 using CentOS packages). I don't understand why showscores.sh shows INFINITY for my OCF resources, but an integer value for the IPaddr2 resource. This is expected. In a colocated group, only the first resource receives the configured integer stickiness value (times the number of resources in that group). Read below. Here is the output of my showscores.sh: [EMAIL PROTECTED] rss]$ ./showscores.sh ResourceScore NodeStickiness #Fail Fail-Stickiness slink_db-INFINITY slinkfail 1000-30 slink_dbINFINITY slinkmaster 1000-30 slink_ipaddr2 0 slinkfail 1000-30 slink_ipaddr2 400 slinkmaster 1000-30 As you see here. You have a node preference of 100 plus 3 * 100 stickiness. slink_jboss -INFINITY slinkfail 1000-30 slink_jboss INFINITY slinkmaster 1000-30 The INFINITY is implicitly given by the colocated group - that way your resources run on the same node. -INFINITY is to make sure they dont run on any other node than the one the first resource was started on. With a failure stickiness of -30, you allow your groups resources to fail (400/30)=14 times. Is that what you want? I'm using the Mar 2008 version of showscores.sh (thanks Dominik!), so perhaps this is related to the known issue of meta attributes on the group instead of on the primitive. From your config - no, it's not about that, as you don't have a stickiness meta attribute for the group, just default values. I've been trying to force a failover like this: export OCF_RESKEY_ip=192.168.1.222 for nn in `seq 1 15`; do /usr/lib/ocf/resource.d/heartbeat/IPaddr2 stop sleep 1m done After one the score becomes 200. Then it seems to jump back up to 300 and stays there. It never proceeds down below zero as I expect. I have a colocation constraint, as you can see in my cib.xml. You don't have any monitor operations for the ipaddr and jboss resources. Failures on them are not detected. Configure monitor operations and try again. Also make sure you use a recent version. Otherwise you may also hit the bug of not increasing failcount in 2.1.3's crm. This is fixed in pacemaker (0.6.x) This line from your config: rsc_colocation id=colocation_MyGroup from=MyGroup to=MyGroup score=INFINITY/ is not needed. I don't even know what you want to express with this. Regards Dominik ps. I'll add the group score things to http://www.linux-ha.org/ScoreCalculation soon. pps. where did you get the jboss RA? I'd be interested in it. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD problems
Néstor wrote: I am getting this errors when running the command drbdadm adjust mysql on WAHOO: *Failure: (114) Lower device is already claimed. This usually means it is mounted. Well, is it? Command 'drbdsetup /dev/drbd0 disk /dev/sda2 /dev/sda2 internal --set-defaults --create-device --on-io-error=detach' terminated with exit code 10* And on my second node WAHOO2, I get: *No response from the DRBD driver! Is the module loaded? Command 'drbdsetup /dev/drbd0 disk /dev/sda2 /dev/sda2 internal --set-defaults --create-device --on-io-error=detach' terminated with exit code 20* Saw this a couple of times, too. Which drbd version are you using? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD problems
Néstor wrote: Version 8.2.5 I think is telling me that the device is already mounted. Right. Is it? What I do not understand them is how to pick a directoy or device. Do I need to re-partition my device to create a separate device for drbd or can I pick just a directory within the device partition that I want to use. You can use an existing partition or create a new one if you have space left to do so. Drbd cannot be used on files or directories. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Help with failure-stickiness
Roland G. McIntosh schrieb: I've got 3 resources in a group, and I'd like to configure stickiness values such that if there are more than 3 failures in the group all resources go to the failover node. I've read http://www.linux-ha.org/v2/faq/forced_failover many times, but do not quite understand from that how to configure my cluster stickiness values Did you read http://www.linux-ha.org/ScoreCalculation? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] slave's drbd resource doesn't get promote when master dies
Jean-Francois Malouin wrote: I thought I had it nailed but still no go. I'm running a simple two-nodes Active/Passive, Debian/Etch cluster with apache, mysql, heartbeat-2.1.3 and drbd-8.2.5 using mcast on the primary NIC and bcast on secondary GigE interfaces which is also the replication link for drbd. I also setup a serial link between the nodes. I've setup dopd as per the drbd user guide and Florian's blog and seems to work as documented when I 'ifconfig down eth1' or do other nasty things with the x-over cable. I can migrate the drbd master manually back and forth between the nodes (feeble-0 and feeble-1) but if I crm_standy the master or shutdown/pull the plug on the primary then secondary doesn't get promoted and the drbd get split-brained and I must then manuallly untangle the mess. My cib is obviously not correct but my brain is having a hard time parsing the xml...any pointers please? The xml looks good to me. Log show after attempting a crm_standby: pengine[5003]: 2008/03/19_16:55:58 info: unpack_nodes: Node feeble-1 is in standby-mode pengine[5003]: 2008/03/19_16:55:58 info: determine_online_status: Node feeble-1 is standby pengine[5003]: 2008/03/19_16:55:58 info: determine_online_status: Node feeble-0 is online pengine[5003]: 2008/03/19_16:55:58 WARN: unpack_rsc_op: Processing failed op drbd_id:0_promote_0 on feeble-0: Error Find out why this failed. pengine[5003]: 2008/03/19_16:55:58 notice: clone_print: Master/Slave Set: ms-drbd_id pengine[5003]: 2008/03/19_16:55:58 notice: native_print: drbd_id:0 (heartbeat::ocf:drbd): Master feeble-0 FAILED pengine[5003]: 2008/03/19_16:55:58 notice: native_print: drbd_id:1 (heartbeat::ocf:drbd): Stopped pengine[5003]: 2008/03/19_16:55:58 notice: native_print: fs_id (heartbeat::ocf:Filesystem):Stopped pengine[5003]: 2008/03/19_16:55:58 notice: native_print: ip_id (heartbeat::ocf:IPaddr):Stopped pengine[5003]: 2008/03/19_16:55:58 notice: native_print: mysql_id (heartbeat::ocf:mysql): Stopped pengine[5003]: 2008/03/19_16:55:58 notice: native_print: apache_id (heartbeat::ocf:apache):Stopped pengine[5003]: 2008/03/19_16:55:58 notice: native_print: email_id (heartbeat::ocf:MailTo):Stopped pengine[5003]: 2008/03/19_16:55:58 WARN: native_color: Resource drbd_id:1 cannot run anywhere 2 node cluster, one node in standby, failed start on the other node, that means the resource cannot run anywhere. cib.xml resources and constraints sections: resources master_slave id=ms-drbd_id meta_attributes id=ma-ms-drbd1_id attributes nvpair id=ma-ms-drbd-1_id name=clone_max value=2/ nvpair id=ma-ms-drbd-2_id name=clone_node_max value=1/ nvpair id=ma-ms-drbd-3_id name=master_max value=1/ nvpair id=ma-ms-drbd-4_id name=master_node_max value=1/ nvpair id=ma-ms-drbd-5_id name=notify value=yes/ nvpair id=ma-ms-drbd-6_id name=globally_unique value=false/ nvpair id=ma-ms-drbd-7_id name=target_role value=started/ /attributes /meta_attributes primitive id=drbd_id class=ocf provider=heartbeat type=drbd operations op id=drbd-monitoring interval=30s name=monitor timeout=15s/ You might want to monitor both the slave and the master here. operations op id=op1 name=monitor interval=5s timeout=5s role=Master/ op id=op2 name=monitor interval=6s timeout=5s role=Slave/ /operations Make sure you use different intervals, because multiple monitor operation with the same interval on one resource are not supported. /operations instance_attributes id=ia-drbd_id attributes nvpair id=drdb-resource_id name=drbd_resource value=r0/ /attributes /instance_attributes /primitive /master_slave primitive id=fs_id class=ocf provider=heartbeat type=Filesystem operations op id=Filesystem_Monitoring interval=10s name=monitor timeout=30s/ /operations instance_attributes id=ia-fs_id attributes nvpair id=ia-fs-1_id name=fstype value=ext3/ nvpair id=ia-fs-2_id name=directory value=/export_www/ nvpair id=ia-fs-3_id name=device value=/dev/drbd1/ /attributes /instance_attributes /primitive primitive id=ip_id class=ocf provider=heartbeat type=IPaddr operations op id=ip-monitoring interval=10s name=monitor timeout=30s/ /operations instance_attributes id=ia-ip_id attributes nvpair id=ip_id name=ip value=132.206.178.80/ /attributes /instance_attributes /primitive primitive id=mysql_id class=ocf provider=heartbeat type=mysql operations op id=mysql-monitoring interval=10s name=monitor timeout=30s/ /operations /primitive primitive id=apache_id class=ocf provider=heartbeat type=apache operations op id=apache-monitoring interval=10s name=monitor timeout=30s/ /operations /primitive primitive id=email_id class=ocf
Re: [Linux-HA] Demote primary when connectivity lost.
Guy wrote: Hi guys, After much fiddling and learning (still loads to do though) I've got my 2 node primary/secondary secondary/primary set more or less working. One failure of node 1, node 2 takes both drbd partitions as primary and mounts the partitions and nfs etc etc. When node 1 is brought back up I wait for node 2 to sync back over the old primary on node 1 and then use crm_resource to move the one primary back to node 1. Is there any way to do this automatically? Wait for drbd to sync and then push primary back to node 1? This is just curiosity, doing it manually gives me a chance to see that all is well. The problem I've really got is if one node just loses connectivity. I've played around with dopd and pingd but this hasn't given the desired results. I have one interface into the network and one connecting the machines by crossover. If node 1 loses connectivity (with dopd running) it fences /dev/drbd0 on node 2, thus stopping node 2 from taking it as primary. What I really need it to do is force any primary partitions into secondary mode if the node loses connectivity so that the live node can sync back to it after recovery of connectivity. I don't see that dopd can help me with this, so do I make some sort of constraint with pingd to demote the partitions if there's no connectivity? Just set a score of -infinity for the master role when pingd attribute is 0 or undefined. sth like rsc_location id=my_resource:connected rsc=my_resource rule role=master id=my_resource:connected:rule score=-INFINITY boolean_op=or expression id=my_resource:connected:expr:undefined attribute=pingd operation=not_defined/ expression id=my_resource:connected:expr:zero attribute=pingd operation=lte value=0/ /rule /rsc_location I have location constraints putting one primary partition on each node, so would I need to do something with the scoring to ensure that demoted partitions stayed that way until resync by drbd was done? That does not seem possible right now. I'd go with keeping the primary on the second node until you manually verified drbd has synced and then migrate manually. I've attached my conf files. As you can see the only constraints I currently have are the location preferences for the primary partitions and the colocation and order constraints to ensure the groups for the fs, nfs and ipaddr only start on the appropriate node. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups
Adrian Chapela wrote: Dominik Klein escribió: Hi during the writeup of ScoreCalculation on the wiki, I noticed something strange. It'd be nice to know whether this is on purpose or a bug. Test setup is: 2 nodes, 1 drbd device, a group of 3 resource which are to run on top of the drbd master. resource_stickiness is set to 100 If I use a colocation constraint with a score of infinity, the master receives a stickiness bonus of 600. If I change the colocation score to a numeric value (I tested 1000 and 5000), the bonus is reduced to 400. I could explain the 600 as 2 * num_resources * stickiness, but I cannot see where those 400 come from. Is this a bug or (why?) is this intended? I think there is a BUG related with master_slave resources. I have opened this bug: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1852 and today I have no response... :( Is this which are you talking about ? No, but I experienced that as well. I don't know why it happens, but I think you can get around it. Please try this: !-- make clone instance :0 run on node1 -- rsc_location id=rscloc-ms-drbd1:0 rsc=drbd1:0 rule id=rscloc-ms-drbd1:0-rule1 score=500 expression id=rscloc-ms-drbd1:0-rule1-expr attribute=#uname operation=eq value=node1/ /rule rule id=rscloc-ms-drbd1:0-rule2 score=-500 expression id=rscloc-ms-drbd1:0-rule2-expr attribute=#uname operation=eq value=node2/ /rule /rsc_location !-- make clone instance :1 run on node2 -- rsc_location id=rscloc-ms-drbd1:1 rsc=drbd1:1 rule id=rscloc-ms-drbd1:1-rule1 score=500 expression id=rscloc-ms-drbd1:1-rule1-expr attribute=#uname operation=eq value=node2/ /rule rule id=rscloc-ms-drbd1:1-rule2 score=-500 expression id=rscloc-ms-drbd1:1-rule2-expr attribute=#uname operation=eq value=node1/ /rule /rsc_location Solved that problem for me. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups
Solved that problem for me. At least with a colocated resource I have to add. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups
Dominik Klein wrote: Solved that problem for me. At least with a colocated resource I have to add. Urghs. Friday afternoon ... I just wanted to verify that and it turns out my method does not restart the whole thing on a slave failure. Thats true. But it does still restart the whole thing if I shutdown the slave node and let it rejoin the cluster. In fact, what I see is that for a short time, BOTH nodes become standby. Right after that, both nodes are shown as online and then the master_slave resource including all clone instances and colocated resources are restarted. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD+Hearbeat not working as intended
Hi I have a drbd+heartbeat setup and I am having a problem. If the machine which drbd is master shuts down the passive machine does not change its status to active one and because of that it cant mount the drbd file system. Can anyone give me some feedback in this ?? here is my cib.xml cib admin_epoch=0 have_quorum=true ignore_dtd=false num_peers=2 cib_feature_revision=2.0 epoch=83 generated=true ccm_transition=4 dc_uuid=56ec2257-b0e1-4395-8ca2-ff2f96151b55 num_updates=1 cib-last-written=Fri Feb 29 08:00:29 2008 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1204282824/ /attributes /cluster_property_set /crm_config nodes node id=34a67e55-71b1-421f-96cc-519ef05b110b uname=pgslave.blumar.com.br type=normal instance_attributes id=nodes-34a67e55-71b1-421f-96cc-519ef05b110b attributes nvpair id=standby-34a67e55-71b1-421f-96cc-519ef05b110b name=standby value=off/ /attributes /instance_attributes /node node id=56ec2257-b0e1-4395-8ca2-ff2f96151b55 uname=pgmaster.blumar.com.br type=normal/ /nodes resources master_slave id=ms-drbd0 meta_attributes id=ma-ms-drbd0 attributes nvpair id=ma-ms-drbd0-1 name=clone_max value=2/ nvpair id=ma-ms-drbd0-2 name=clone_node_max value=1/ nvpair id=ma-ms-drbd0-3 name=master_max value=1/ nvpair id=ma-ms-drbd0-4 name=master_node_max value=1/ nvpair id=ma-ms-drbd0-5 name=notify value=yes/ nvpair id=ma-ms-drbd0-6 name=globally_unique value=false/ nvpair id=ma-ms-drbd0-7 name=target_role value=started/ /attributes /meta_attributes primitive id=drbd0 class=ocf provider=heartbeat type=drbd instance_attributes id=ia-drbd0 attributes nvpair id=ia-drbd0-1 name=drbd_resource value=repdata/ /attributes /instance_attributes meta_attributes id=drbd0:0_meta_attrs attributes/ /meta_attributes /primitive /master_slave group id=group_pgsql meta_attributes id=group_pgsql_meta_attrs attributes nvpair id=group_pgsql_metaattr_target_role name=target_role value=stopped/ Why stopped here? /attributes /meta_attributes primitive id=resource_ip class=ocf type=IPaddr provider=heartbeat instance_attributes id=resource_ip_instance_attrs attributes nvpair id=f294ba51-00f9-4a9b-80c0-43aa7944f474 name=ip value=10.3.3.24/ /attributes /instance_attributes meta_attributes id=resource_ip_meta_attrs attributes nvpair id=resource_ip_metaattr_target_role name=target_role value=started/ But started here? /attributes /meta_attributes /primitive primitive id=resource_FS class=ocf type=Filesystem provider=heartbeat instance_attributes id=resource_FS_instance_attrs attributes nvpair id=d1413b97-4944-4fee-9cba-b4ab3e71f83f name=device value=/dev/drbd0/ nvpair id=baad1fbb-3389-4778-8832-91d5720341a6 name=directory value=/repdata/ nvpair id=35e6e951-ace4-4bbf-9d3e-5824219a9809 name=fstype value=ext3/ /attributes /instance_attributes meta_attributes id=resource_FS_meta_attrs attributes nvpair id=resource_FS_metaattr_target_role name=target_role value=started/ And started here? /attributes /meta_attributes /primitive primitive id=resource_pgsql class=ocf type=pgsql provider=heartbeat meta_attributes id=resource_pgsql_meta_attrs attributes nvpair id=resource_pgsql_metaattr_target_role name=target_role value=started/ And started here as well. That does not make sense. It is sufficient to set the target_role for the group. You do not have to set it for each resource. /attributes /meta_attributes /primitive /group /resources constraints rsc_colocation id=colocation_ from=group_pgsql to=group_pgsql score=INFINITY/ This does not make any sense. rsc_location id=location_ rsc=group_pgsql rule id=prefered_location_ score=INFINITY expression attribute=#uname id=a7b8a885-27c2-4157-a597-42dd6cafba8c operation=eq value=pgmaster.blumar.com.br/ /rule
Re: [Linux-HA] Enhanced version of showscores and a major updateonthe score calculation documentation
Hi Dominik, this looks good now. Thank you for fixing. You're welcome. One question: Are you able to cache the default stickiness values? If you determine that every loop it costs time. Good idea. Thanks. The script runs here 5,7 seconds for 18 resources on the DC. That's long. Regards Dominik showscores.sh Description: Bourne shell script ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation
Hi just yesterday I found a way better way to get the scores from ptest. This new version can not only display normal scores, it can also display master scores, which seems quite important these days. It still produces a heck of a lot of logs when executed, but thats just the nature of the commands I use - thats not to change right now :( Now you can also sort the output by node or resource. Resource is the default, give node as a parameter to sort by node. Please test and report problems. Also, I put a major update on the ScoreCalculation page this morning to make things about scores clearer. Especially concerning groups and master_slave resources. Post questions if there's something that's not clear yet. http://www.linux-ha.org/ScoreCalculation Regards Dominik showscores.sh Description: Bourne shell script ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Which STONITH devices is everybody using?
Thanks for the replies so far. No one else? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation
Andreas Mock wrote: Hi Dominik, you know I like your script, but the newest version broke something: When a resource name has '.' (dots) in it, That might just be. Use dashes ;) the way you split the $line in the while loop to get the score, node and resource name doesn't work any more. Can you send me a line of relevant ptest-output with the pattern After in it. I don't have such in my configuration. I sent you a PM with the output from my test system. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation
Use dashes ;) Well - turns out to be hyphen. At least as of dict.leo.org :) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster
In a colocated group (which is the default), all subsequent resources are tied to the group's first resource with a score of INFINITY. To not allow them to run on another node but the node the first resource is run on, they also get -INFINITY for any other node. Thank you very much, Dominik, for your reply - but how can I then achieve the Intended behavior: group failover on the third failure ? Although I cannot explain it score-wise, as you can only see INFINITY for the group resources, this should work. Just let a resource in the group fail a couple of times and see what happens. Works for me. I'll have Andrew explain this when he's back from Australia :) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster
Hi Dominik, I tried to let the resource in the group fail a couple of times, but after the 2-nd try will the failcount for both resources NOT increased. Did you wait for the cluster to restart the resource after you produced the failure before causing another failure? It shows after each try (with ifconfig eth0:x down ) still the same: Resource Score Node Stickin. Failc. Fail.-Stick. IPaddr_193_27_40_57 0 dbora 2 0 -3 IPaddr_193_27_40_57 1 demo 2 1 -3 ubis_udbmain_13 -INFINITY dbora 2 0 -3 ubis_udbmain_13 INFINITY demo 2 1 -3 Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: AW: AW: AW: [Linux-HA] Switchover problem with DRBD
Schmidt, Florian wrote: Hello again, you're right, i do have DRBD 8.2.1 installed. Well, you mean downgrading on 0.7.x would be better? This is only a test-cluster so this shouldn't be a problem. But I'll try re-installing my current DRBD-version first and then (if this doesn't help) downgrading to 0.7.x-DRBD. I guess upgrading to 8.2.5 would be the way to go. Although I'm not sure the issue you had was fixed in that version, a downgrade to 0.7 should never be needed. I sometimes saw what you reported (no response from drbd module) during my own tests, but I was not able to reproduce it. Sorry. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster
Nikita Michalko wrote: Hi all! I have some troubles with HA V2.1.3 on SLES10 SP1, two-node cluster with 1 resource group=2 resources. Intended is forced failover of the group on the third failure of any resource in the group; one node is preferred over the other (see attached configuration). After start are resources running on the preferred node (demo), as expected, but with 1 failcount and with following score (script showscores): Resource Score Node Stick. Failcount Fail.-Stickiness IPaddr_193_27_40_57 0 dbora2 0 -3 IPaddr_193_27_40_57 2 demo2 0 -3 ubis_udbmain_13 -INFINITYdbora 2 0 -3 ubis_udbmain_13 INFINITYdemo 2 1-3 Score of the first resource (IPaddr_193_27_40_57) is 2 as expected (group resource_stickiness=1) , but the second resource has score INFINITY- why ? Because of added colocation constraint for group ? In a colocated group (which is the default), all subsequent resources are tied to the group's first resource with a score of INFINITY. To not allow them to run on another node but the node the first resource is run on, they also get -INFINITY for any other node. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Which STONITH devices is everybody using?
Hi I read the list of stonith plugins and had a look at which devices they support. The list of devices you can buy new today was rather small. So I'd like to know which STONITH devices heartbeat users use. Which device do you use? What kind of device is it? How much is it? How well does it work? What problems did you encounter? Did you have to write the stonith plugin yourself? If so, did you contribute it to the project? As I'm not much of a programmer and have to buy some stonith hardware within the next weeks, help on this would be much appreciated. Regards Dominik I'll even start with a reply myself: APC AP7920 power distribution unit. Cost about 450 Euros a piece. It seems to work well, do not have it in production yet. I set it up with apcmastersnmp. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Force switch with DRBD
DucaConte Balabam wrote: Hello, I've a cluster using heartbeat v2 and drbd in master/slave configuration. It's: Last updated: Tue Mar 4 09:49:30 2008 Current DC: rman1c (875afc12-b88e-4940-9816-218d2a5911c3) 2 Nodes configured. 2 Resources configured. Node: rman1a (4d7bd4ec-c121-4b13-a2d4-aec820ea36d5): online Node: rman1c (875afc12-b88e-4940-9816-218d2a5911c3): online Master/Slave Set: ms-drbd0 drbd0:0 (heartbeat::ocf:drbd): Master rman1a drbd0:1 (heartbeat::ocf:drbd): Started rman1c Resource Group: Oracle FS (heartbeat::ocf:Filesystem):Started rman1a V_IP(heartbeat::ocf:IPaddr2): Started rman1a Ora_DB (heartbeat::ocf:oracle):Started rman1a Ora_LSNR(heartbeat::ocf:oralsnr): Started rman1a How can I force all resources to move to the other node? There's acommand? Try crm_resource -M -r Oracle This creates a rsc_location constraint, that does not allow Oracle to run in its momentary location. Therefore, the cluster will migrate it. Don't forget crm_resource -U -r Oracle afterwards to remove that rsc_location constraint (and allow the resource to run on the first node again). If you do not do this, the resource will never be able to run on the first node again. Depending on your configuration, the resource might be migrated back to the first node after the rsc_location constraint has been removed. If so, you should read http://wiki.linux-ha.org/ScoreCalculation and set resource_stickiness to a reasonable value. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to allow resources to ping-pong forever?
Alex Spengler wrote: Hi, I'm stuck in setting up my cluster. What I want to achive is - run apache on whatever node together with cluster IP which is 172.23.100.200. - if apache fails - switch over to other node - if gateway 172.23.100.1 is not reachable - switch over to other node AND allow unlimited number of switchovers! This is the problem, it does switch over but only once or twice and then it's stuck .. any ideas? *unlimited* is pretty much impossible I think. But you may start with this: Set a #uname score of 5050 for one node, 5000 for the other node. Set resource_failure_stickiness to -100. Use a multiplier of 100 for pingd and pingd as a score_attribute (sth like: rsc_location id=rsc-loc-syslog rsc=syslog rule id=syslog-connected score_attribute=pingd expression id=syslog-connected-rule-1 attribute=pingd operation=defined/ /rule /rsc_location ) Then you will end up with: Startup node1: 5050 + 100 pingd node2: 5000 + 100 pingd decision: start on node1 now, whatever fails will reduce the score by 100, causing a failover. With a start at 5000, you can have 50 failovers. Enlarge as needed. Regards Dominik thanks in advance Alex *Here my config:* *Node1 - big-sSTATSfe1* eth0: 172.23.100.26 eth1: 192.168.0.1 *Node2 - big-sSTATSfe2* eth0: 172.23.100.22 eth1:192.168.0.2 *ha.cf:* use_logd yes node big-sSTATSfe1 big-sSTATSfe2 deadtime 5 deadping 5 initdead 60 warntime 3 crm true ucast eth0 172.23.100.22 # 172.23.100.26 on the second node ucast eth1 192.168.0.2# 192.168.0.1 on the second node ping 172.23.100.1 *cib.xml* cib admin_epoch=1 epoch=1 num_updates=1 generated=true have_quorum=true ignore_dtd=false num_peers=2 dc_uuid=68cd29ed-c7fe-44d9-9fe8-a1258e5b1d0f configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-symmetric-cluster name=symmetric-cluster value=true/ nvpair id=cib-bootstrap-options-no-quorum-policy name=no-quorum-policy value=ignore/ nvpair id=cib-bootstrap-options-default-resource-stickiness name=default-resource-stickiness value=0/ nvpair id=cib-bootstrap-options-default-resource-failure-stickiness name=default-resource-failure-stickiness value=-100/ nvpair id=cib-bootstrap-options-stonith-enabled name=stonith-enabled value=true/ nvpair id=cib-bootstrap-options-stonith-action name=stonith-action value=reboot/ nvpair id=cib-bootstrap-options-remove-after-stop name=remove-after-stop value=false/ nvpair id=cib-bootstrap-options-short-resource-names name=short-resource-names value=true/ nvpair id=cib-bootstrap-options-transition-idle-timeout name=transition-idle-timeout value=1min/ nvpair id=cib-bootstrap-options-default-action-timeout name=default-action-timeout value=10s/ nvpair id=cib-bootstrap-options-is-managed-default name=is-managed-default value=true/ /attributes /cluster_property_set /crm_config nodes/ resources group id=apache_group_p80 ordered=true collocated=true primitive class=ocf provider=heartbeat type=IPaddr id=IPaddr_p80 instance_attributes id=IPaddr_1_inst_attr attributes nvpair id=IPaddr_p80_attr_0 name=ip value=172.23.100.200/ nvpair id=IPaddr_p80_attr_1 name=netmask value=255.255.255.0/ nvpair id=IPaddr_p80_attr_2 name=nic value=eth0/ nvpair id=IPaddr_p80_attr_3 name=broadcast value=172.23.100.255 / /attributes /instance_attributes operations op id=IPaddr_p80_mon name=monitor interval=2s timeout=3s/ /operations /primitive primitive id=apache_p80 class=lsb type=apache provider=heartbeat instance_attributes id=inatt_apache_p80 attributes nvpair name=configfile value=/etc/httpd/conf/httpd.conf id=nvpb1_apache_p80/ nvpair name=statusurl value= http://172.23.100.200:80/server-status http://172.23.100.200/server-status id=nvpb2_apache_p80/ /attributes /instance_attributes operations op id=apache_p80:start name=start timeout=10s/ op id=apache_p80:stop name=stop timeout=10s/ op id=apache_p80:monitor name=monitor interval=2s timeout=5s/ /operations /primitive /group clone id=pingd instance_attributes id=pingd attributes nvpair id=pingd-clone_max name=clone_max value=2/ nvpair id=pingd-clone_node_max name=clone_node_max value=1/ /attributes /instance_attributes primitive id=gateway class=ocf type=pingd provider=heartbeat operations op id=gateway:child-monitor name=monitor interval=5s timeout=5s prereq=nothing/ op id=gateway:child-start name=start prereq=nothing/ /operations instance_attributes id=pingd_inst_attrs attributes nvpair id=pingd-dampen name=dampen value=5s/ nvpair id=pingd-multiplier name=multiplier value=100/ /attributes /instance_attributes /primitive /clone
Re: [Linux-HA] (no subject)
Dominik Klein wrote: Schmidt, Florian wrote: Hello list, I still have problem with my heartbeat-config I want heartbeat to start AFD. I checked the RA for LSB-compatibility and think that it's right now. The log file says, that the bash does not find the command afd. crmd[2725]: 2008/02/29_11:32:00 info: do_lrm_rsc_op: Performing op=AFD_start_0 key=5:1:71d7b119-77c3-4e6b-9ac6-6acf20d8cf61) lrmd[2722]: 2008/02/29_11:32:01 info: rsc:AFD: start lrmd[2785]: 2008/02/29_11:32:01 WARN: For LSB init script, no additional parameters are needed. lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout) Starting AFD for afdha : lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stderr) bash: afd: command not found tengine[2773]: 2008/02/29_11:32:01 info: extract_event: Aborting on transient_attributes changes for 44425bd9-2cba-4d6a-ac62-82a8bb81a23d lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout) Failed tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort priority upgraded to 100 crmd[2725]: 2008/02/29_11:32:01 ERROR: process_lrm_event: LRM operation AFD_start_0 (call=3, rc=1) Error unknown error tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort action 0 superceeded by 2 tengine[2773]: 2008/02/29_11:32:01 WARN: status_from_rc: Action start on noderz failed (target: null vs. rc: 1): Error tengine[2773]: 2008/02/29_11:32:01 WARN: update_failcount: Updating failcount for AFD on 91d062c3-ad0a-4c24-b759-acada7f19101 after failed start: rc=1 There is an extra user for AFD, called afdha. The afdha-script is attached. Switching to to afdha via su afdha I can start AFD, so I don't understand, where's the problem. ( su $afduser -c afd -a Try su - $afduser -c afd -a ---^ This will make su login as the user (i.e. execute its profile). Right now, you're running this as root, in the root-environment. If afd is not in root's PATH, then it cannot work. To add some more possibilities: You could also use export PATH=$PATH:whatever at the start of the script, so you have afd in your PATH. Or use /path/to/afd instead of afd. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] (no subject)
Schmidt, Florian wrote: Hello list, I still have problem with my heartbeat-config I want heartbeat to start AFD. I checked the RA for LSB-compatibility and think that it's right now. The log file says, that the bash does not find the command afd. crmd[2725]: 2008/02/29_11:32:00 info: do_lrm_rsc_op: Performing op=AFD_start_0 key=5:1:71d7b119-77c3-4e6b-9ac6-6acf20d8cf61) lrmd[2722]: 2008/02/29_11:32:01 info: rsc:AFD: start lrmd[2785]: 2008/02/29_11:32:01 WARN: For LSB init script, no additional parameters are needed. lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout) Starting AFD for afdha : lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stderr) bash: afd: command not found tengine[2773]: 2008/02/29_11:32:01 info: extract_event: Aborting on transient_attributes changes for 44425bd9-2cba-4d6a-ac62-82a8bb81a23d lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout) Failed tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort priority upgraded to 100 crmd[2725]: 2008/02/29_11:32:01 ERROR: process_lrm_event: LRM operation AFD_start_0 (call=3, rc=1) Error unknown error tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort action 0 superceeded by 2 tengine[2773]: 2008/02/29_11:32:01 WARN: status_from_rc: Action start on noderz failed (target: null vs. rc: 1): Error tengine[2773]: 2008/02/29_11:32:01 WARN: update_failcount: Updating failcount for AFD on 91d062c3-ad0a-4c24-b759-acada7f19101 after failed start: rc=1 There is an extra user for AFD, called afdha. The afdha-script is attached. Switching to to afdha via su afdha I can start AFD, so I don't understand, where's the problem. ( su $afduser -c afd -a Try su - $afduser -c afd -a ---^ This will make su login as the user (i.e. execute its profile). Right now, you're running this as root, in the root-environment. If afd is not in root's PATH, then it cannot work. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: AW: [Linux-HA] (no subject)
Hello, I wrote the following lines on top of the script code: export PATH=$PATH:/home/afdha export AFD_WORK_DIR=/usr/afd AFD needs one environment-variable named AFD_WORK_DIR to know, where to work I also did chown afdha:501 /usr/afd and chown /home/afdha How far does this work, because the heartbeat still does not start AFD, but logs errors: crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing op=AFD_start_0 key=4:1:151fabd8-d44a-40b1-b946-38480dbd8c8f) lrmd[3538]: 2008/02/29_13:58:26 info: rsc:AFD: start lrmd[3583]: 2008/02/29_13:58:26 WARN: For LSB init script, no additional parameters are needed. lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout) Starting AFD for afdha : lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr) ERROR : Failed to determine AFD working directory! lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr) No option -w or environment variable AFD_WORK_DIR set. This read as if you could start afd -w /usr/afd ... to set the work dir. lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout) Failed. crmd[3541]: 2008/02/29_13:58:26 info: process_lrm_event: LRM operation AFD_start_0 (call=3, rc=0) complete tengine[3556]: 2008/02/29_13:58:26 info: match_graph_event: Action AFD_start_0 (4) confirmed on noderz (rc=0) tengine[3556]: 2008/02/29_13:58:26 info: send_rsc_command: Initiating action 5: AFD_monitor_3 on noderz crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing op=AFD_monitor_3 key=5:1:151fabd8-d44a-40b1-b946-38480dbd8c8f) lrmd[3538]: 2008/02/29_13:58:27 info: RA output: (AFD:monitor:stderr) ERROR : Failed to determine AFD working directory! No option -w or environment variable AFD_WORK_DIR set. How can I set this environment-variable AFD_WORK_DIR globally I have it in ~/.bash_profile, but this doesn't work for the afdha-user :-( I'm working with linux only since 4 or 5 weeks (since I started building that heartbeat/drbd-cluster)...so my skills are not that good, but I try to improve ;) Thanks Florian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Script to calculate scores to allow a defined number of resource failures before failover
Some cosmetic changes. Thanks to wschlich. Regards Dominik calc_linux_ha_scores.sh Description: Bourne shell script ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Question of Service Monitoring in HAv2
I had a query about the service monitoring in HA v2, I was wondering if i can configure it in such a way that if a service fails , heartbeat should try to restart it say n number of times before it fences the system It will (by default) only fence the system, if stop fails. If monitor fails, it will only try to restart the resource on the best node. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Node Priority
When heartbeat is started(on both nodes) my node called pgslave gets promoted to DC and that cannot happen, my node pgmaster should always be the active part of the service and also this node needs to always try to get promoted to DC when he have the chance(pgslave gotta spend minimal time being DC, only in case of pgmaster failing). Can you guys give me some advice on how to solve this ?? Start pgmaster alone, wait until it becomes DC, then start pgslave. Other than that, you can not do anything about which node is the DC. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: AW: AW: [Linux-HA] (no subject)
Schmidt, Florian wrote: Mit freundlichen Grüßen Hello, I wrote the following lines on top of the script code: export PATH=$PATH:/home/afdha export AFD_WORK_DIR=/usr/afd AFD needs one environment-variable named AFD_WORK_DIR to know, where to work I also did chown afdha:501 /usr/afd and chown /home/afdha How far does this work, because the heartbeat still does not start AFD, but logs errors: crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing op=AFD_start_0 key=4:1:151fabd8-d44a-40b1-b946-38480dbd8c8f) lrmd[3538]: 2008/02/29_13:58:26 info: rsc:AFD: start lrmd[3583]: 2008/02/29_13:58:26 WARN: For LSB init script, no additional parameters are needed. lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout) Starting AFD for afdha : lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr) ERROR : Failed to determine AFD working directory! lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr) No option -w or environment variable AFD_WORK_DIR set. This read as if you could start afd -w /usr/afd ... to set the work dir. I'm just trying this... But is there no possibility to set this variable globally? So well, thanks...THIS worksbut now next errors occurred -.- I once again tried this with su - afdha /etc/init.d/afdha start Starting AFD for afdha : Password: su: incorrect password Failed. touch: cannot touch `/var/lock/subsys/afd': Permission denied I don't know, what password he needs... The other problem is the denied permission for /var/lock/subsys/afd. How can I permit him to create this file there? I can create it by hand and chown this to him, but he will remove it at next stop :( join IRC #linux-ha :) ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Awesome explanation of stickiness scores :)
Fajar Priyanto wrote: Hi all, This afternoon I would have been asked a question about resource| failure_stickiness, because I was a bit confused the practical use of those stickiness in relation with score in location constrains. But, this page has explained it all very clearly: http://www.linux-ha.org/ScoreCalculation Excellent :) Hopefully this helps other as it goes into the list's archives. I'll consider this a compliment :) Thanks! Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Deleting Master/Slave-Resources
Schmidt, Florian wrote: Hi list, I'm not able to delete my DRBD-Master/Salve-Set. I tried with crm_resource -D -r drbd_master_slave -t clone and crm_resource -D -r drbd_master_slave -t master-slave drbd_master_slave is the name of my resource. Anyone a short advice? cibadmin -Q -o resources copy your resource to the clipboard cibadmin -D -X 'paste'enter Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Newest version of 'showscores'
thanky you for the script and for pointing to the right direction. May I change the format of the output? (Yes, I also saw the thing with wrong headings) Here's a newer version. It can now read resource-stickiness and resource_stickiness (notice the - and _). Both is possible, but up to now, only one was looked for. Also fixed a problem with the headings being mixed up. I know this thing produces a lot of logs, but at least it does display the scores, hu? :) #!/bin/bash # Feb 2008, Dominik Klein # Display scores of Linux-HA resources # Known issues: # * cannot get resource[_failure]_stickiness values for master/slave and clone resources # if those values are configured as meta attributes of the master/slave or clone resource # instead of as meta attributes of the encapsulated primitive if [ `crmadmin -D | cut -d' ' -f4` != `uname -n|tr [:upper:] [:lower:]` ] then echo Warning: Script running not on DC. Might be slow(!) fi # Heading printf %-16s%-16s%-16s%-16s%-16s%-16s\n Resource Score Node Stickiness Failcount Failure-Stickiness 21 ptest -LVVV|grep -E assign_node|rsc_location|grep -w -E \ [-]{0,1}[0-9]*$|while read line do node=`echo $line|cut -d ' ' -f 8|cut -d ':' -f 1` res=`echo $line|cut -d ' ' -f 6|tr -d ,` score=`echo $line|cut -d ' ' -f 9|sed 's/100/INFINITY/g'` # get meta attribute resource_stickiness if crm_resource -g resource_stickiness -r $res --meta /dev/null then stickiness=`crm_resource -g resource_stickiness -r $res --meta 2/dev/null` else if crm_resource -g resource-stickiness -r $res --meta /dev/null then stickiness=`crm_resource -g resource-stickiness -r $res --meta 2/dev/null` # if that doesnt exist, get syntax like primitive resource-stickiness=100 else if ! stickiness=`crm_resource -x -r $res 2/dev/null | grep -E master|primitive|clone | grep -o resource[_-]stickiness=\[0-9]*\ | cut -d '' -f 2 | grep -v ^$` then # if no resource-specific stickiness is confiugured, grep the default value stickiness=`cibadmin -Q -o crm_config 2/dev/null|grep default[_-]resource[_-]stickiness|grep -o -E 'value ?= ?[^ ]*'|cut -d '' -f 2|grep -v ^$` fi fi fi # get meta attribute resource_failure_stickiness if crm_resource -g resource_failure_stickiness -r $res --meta /dev/null then failurestickiness=`crm_resource -g resource_failure_stickiness -r $res --meta 2/dev/null` else if crm_resource -g resource-failure-stickiness -r $res --meta /dev/null then failurestickiness=`crm_resource -g resource-failure-stickiness -r $res --meta 2/dev/null` else # if that doesnt exist, get the default value failurestickiness=`cibadmin -Q -o crm_config 2/dev/null|grep resource[_-]failure[_-]stickiness|grep -o -E 'value ?= ?[^ ]*'|cut -d '' -f 2|grep -v ^$` fi fi failcount=`crm_failcount -G -r $res -U $node 2/dev/null|grep -o -E 'value ?= ?[0-9]*'|cut -d '=' -f 2|grep -v ^$` printf %-16s%-16s%-16s%-16s%-16s%-16s\n $res $score $node $stickiness $failcount $failurestickiness done|sort -k 1 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] linux-ha with drbd -- nothing working
Adam Kaufman wrote: hi all, I've been trying for the last few days to get heartbeat working with a basic drbd configuration. I was initially using heartbeat 2.0.8, but eventually upgraded to 2.1.3. the symptoms exhibited by each version of heartbeat were completely different, so here I'll focus on 2.1.3, as that's what I'd prefer to move forward with. symptoms: upon starting heartbeat, a monitor command is sent through the drbd resource, which returns OCF_NOT_RUNNING the drbd resource(s) is not registered with crm_mon, and cannot be transitioned into a started state configuration: ===cib.xml=== ?xml version=1.0 ? cib admin_epoch=17 epoch=1 have_quorum=true num_updates=0 configuration crm_config cluster_property_set id=test_cluster attributes nvpair id=id-symmetric-cluster name=symmetric-cluster value=True/ nvpair id=id-stickiness name=default-resource-stickiness value=INFINITY/ /attributes /cluster_property_set /crm_config nodes/ resources primitive class=ocf id=ip_resource provider=heartbeat type=IPaddr resource_stickiness=INFINITY You should define resource_stickiness as a meta_attribute. It will work like this, but it is not the best way. instance_attributes attributes nvpair name=ip value=10.107.10.20/ nvpair name=netmask value=22/ nvpair name=nic value=eth0/ /attributes /instance_attributes /primitive master_slave id=ms-drbd0 meta_attributes id=ma-ms-drbd0 attributes nvpair id=ma-ms-drbd0-1 name=clone_max value=2/ nvpair id=ma-ms-drbd0-2 name=clone_node_max value=1/ nvpair id=ma-ms-drbd0-3 name=master_max value=1/ nvpair id=ma-ms-drbd0-4 name=master_node_max value=1/ nvpair id=ma-ms-drbd0-5 name=notify value=yes/ nvpair id=ma-ms-drbd0-6 name=globally_unique value=false/ nvpair id=ma-ms-drbd0-7 name=target_role value=stopped/ You don't want the resource to be started. That's why the score is -INFINITY. /attributes /meta_attributes primitive class=ocf id=drbd0 provider=heartbeat type=drbd resource_stickiness=INFINITY see above instance_attributes id=ia-drbd0 attributes nvpair id=ia-drbd0-1 name=drbd_resource value=drbd0/ /attributes /instance_attributes /primitive /master_slave /resources constraints rsc_location id=run_ip_resource rsc=ip_resource rule id=pref_run_ip_resource score=100 expression attribute=#uname operation=eq value=plat-pc-18/ /rule /rsc_location /constraints /configuration status/ /cib ===drbd.conf=== resource drbd0 { protocol C; incon-degr-cmd echo 'incon-degr-cmd'; startup { degr-wfc-timeout 120; } disk { on-io-error pass_on; } net { on-disconnect reconnect; } syncer { rate 100M; group 1; al-extents 257; } # Begin chassis configuration on plat-pc-18 { device /dev/drbd0; disk /dev/mapper/nrStorage-storage; address10.107.10.38:7789; meta-disk internal; } on plat-pc-17 { device /dev/drbd0; disk /dev/mapper/nrStorage-storage; address10.107.10.37:7789; meta-disk internal; } } debug output: ===crm_verify -L=== crm_verify[7494]: 2008/02/28_17:15:14 info: main: =#=#=#=#= Getting XML =#=#=#=#= crm_verify[7494]: 2008/02/28_17:15:14 info: main: Reading XML from: live cluster crm_verify[7494]: 2008/02/28_17:15:14 notice: main: Required feature set: 2.0 crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'stop' for cluster option 'no-quorum-policy' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'false' for cluster option 'stonith-enabled' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'reboot' for cluster option 'stonith-action' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value '0' for cluster option 'default-resource-failure-stickiness' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'true' for cluster option 'is-managed-efault' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value '60s' for cluster option 'cluster-delay' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value '30' for cluster option 'batch-limit' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value '20s' for cluster option 'default-action-imeout' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-resources' crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default value 'true' for cluster option 'stop-orphan-actions' crm_verify[7494]:
Re: [Linux-HA] Newest version of 'showscores'
Serge Dubrouski wrote: On Thu, Feb 28, 2008 at 9:53 AM, Dejan Muhamedagic [EMAIL PROTECTED] wrote: Hi, On Thu, Feb 28, 2008 at 03:47:04PM +0100, Dominik Klein wrote: thanky you for the script and for pointing to the right direction. May I change the format of the output? (Yes, I also saw the thing with wrong headings) Here's a newer version. It can now read resource-stickiness and resource_stickiness (notice the - and _). Both is possible, but up to now, only one was looked for. Also fixed a problem with the headings being mixed up. I was wondering if you could do it the other way around, i.e. one gives failover requirements such as after third failure move to the other node and the script calculates the various stickiness values. How about that? And I was wondering why it can't be done on CRM level? It would be great to be able to define a max number of allowed failures and let CRM to calculate all necessary scores. Full ACK. That would be nice. I guess I could script sth like that. But imho: If you read the ScoreCalculation page on linux-ha.org, calculating the necessary values for resource_failure_stickiness should be possible for someone who is to administer a cluster. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Clonesets and Resource Groups
Michael is right. Wörz that is :) Didn't see you both had the same first name. Sorry. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Clonesets and Resource Groups
The biggest issue so far is to migrate constraint resources frome one node to another with a single command. I cannot use grouped resources bcs one of the resources must be a cloneset (ocfs) and thus cannot be a member of a group. Why not? You just cannot create this in the GUI. Use CLI. Michael is right. beekhof 02/21/2007 10:16 PM: groups cannot contain anything except primitive resources Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to create meta-data?
1.Iam using DRBD-0.7.21 , how to create meta-data for this system and how to upgrade it to DRBD-8 meta-data? http://blogs.linbit.com/florian/2007/10/03/step-by-step-upgrade-from-drbd-07-to-drbd-8/ 2.My meta-disk option in /etc/drbd.conf file has /dev/hda6 as entry.Is this same as internal meta-data? Sounds more like external meta-data. Post your config file. I don't want to be rude, but did you read any drbd documentation yet? Did you set up the system yourself? Do you ever use google? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to trigger stonith of node
is there a way to trigger the stonith of a node for testing? pkill -9 heartbeat ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Newest version of 'showscores'
Andreas Mock wrote: Hi all, can someone point me to the newest version of 'showscores'. This should be the newest version. http://hg.clusterlabs.org/pacemaker/dev/rev/86e1f081dc7f Apparently it has a mix-up in the headings, but that shouldn't hurt. Someone posted it here once. Yup, that was me :) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to create meta-data?
What does meta-data in DRBD mean http://www.drbd.org/users-guide/ch-internals.html#s-metadata and how to create meta-data. http://www.drbd.org/users-guide/s-first-time-up.html Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd heartbeat v2
Damon Estep wrote: On this page: http://www.linux-ha.org/DRBD/HowTov2 is this comment: drbd must not be started by init Well, you do not have to start drbd by init. But it shouldn't harm if you do. This statement is false if you want to use the heartbeat Resource Agent drbddisk, but that's not what's described in the article. I have tried both 0.7.25 and 8.0.10 for DRBD on heartbeat 2.1.3 with crm=yes and everything set up as outlined on the page. The only reliable combination I can create is drbd 0.7.25, heartbeat 2.1.3, and drbd started by init. What kind of errors do you get? Post config and logs please. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith on an apcmaster
The stonith daemons start successfully now, but with a monitor interval of 15s one of the two fails fairly quickly. The apc (9211 masterswitch) only allows a single login, and I wonder if the two daemons aren't colliding, and one is timing out and giving up. Did you try apcmastersnmp? Don't know wether you have to change that for your particular device as well, though. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd heartbeat v2 working (problem with fs0)
Marco Leone wrote: Hi, I'm using drbd 8.2.4 and heartbeat v.2 too on two ubuntu 7.04 server nodes. I guess you did not completey do that. I followed this link http://linux-ha.org/DRBD/HowTov2 constraints rsc_location id=rsc_location_group_1 rsc=group_1 rule id=prefered_location_group_1 score=100 expression attribute=#uname id=prefered_location_group_1_expr operation=eq value=ub704ha01/ /rule /rsc_location rsc_order id=drbd0_before_fs0 from=fs0 action=start to=ms-drbd0 to_action=promote/ /constraints /configuration /cib Otherwise you'd have the master colocation constraint: rsc_colocation id=fs0_on_drbd0 to=ms-drbd0 to_role=master from=fs0 score=infinity/ This should do what you want. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd heartbeat v2
crm_verify[19814]: 2008/02/19_08:46:57 WARN: unpack_rsc_op: Processing failed op drbd0:1_start_0 on cn2-inverness-co: Error crm_verify[19814]: 2008/02/19_08:46:57 WARN: unpack_rsc_op: Compatability handling for failed op drbd0:1_start_0 on cn2-inverness-co crm_verify[19814]: 2008/02/19_08:46:57 WARN: native_color: Resource drbd0:1 cannot run anywhere Warnings found during check: config may not be valid As you can see here, and as in after.txt - start on the drbd resource on cn2-inverness-co (which I assume is the host you powered off) failed. Check the logs on that. You have a timestamp so that should be fairly easy. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD 8.0 under Debian Etch?
Short question: Does anyone here have DRBD8 running with heartbeat under Etch? Short answer: Yes. Version 8.0.8, upgrading to 8.0.9 within the next days. I use the OCF RA to manage drbd as a Master/Slave Resource. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] send_arp cisco Pix v7.2 = arp table not update
Thanks for your advise, unfortunatelly, that command is not include in the PIX :-(( ... I'am still on that point and I must confess that I have no clue at all... You could also modify the RA and have it set a virtual mac address on the interface. Be sure to set the original mac address back when the VIP leaves that machine, though. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Re: DRBD with monitor Operations won't start - as soon as I delete the operations, it starts immediately
operations op id=op-ms-drbd2-1 name=monitor interval=60s timeout=60s start_delay=30s role=Master/ op id=op-ms-drbd2-2 name=monitor interval=60s timeout=60s start_delay=30s role=Slave/ /operations You cannot have multiple operations with the same interval on a resource. Try to change the interval. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] external/ipmi example configuration
How can I test the stonith plugin eg. tell heartbeat to shoot someone? iptables -I INPUT -j DROP ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DBRD - split brain - and HA is happily migrating
Thanks for your help. It looks like everything works as desired: (postgres-02) [~] ifconfig eth1 down (postgres-02) [~] cat /proc/drbd version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 2007-12-29 17:37:25 0: cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown C r--- ns:60499852 nr:713732 dw:713732 dr:60499852 al:0 bm:3693 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:3777806 misses:3724 starving:0 dirty:0 changed:3724 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 (postgres-02) [~] ifconfig eth1 up (postgres-02) [~] cat /proc/drbd version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 2007-12-29 17:37:25 0: cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown C r--- ns:60499852 nr:713732 dw:713732 dr:60499852 al:0 bm:3693 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:3777806 misses:3724 starving:0 dirty:0 changed:3724 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 (postgres-02) [~] cat /proc/drbd version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 2007-12-29 17:37:25 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:60499852 nr:715292 dw:715292 dr:60499852 al:0 bm:3705 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:3777942 misses:3736 starving:0 dirty:0 changed:3736 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 When I put the crosslink link down, the disk on postgres-02 gets outdated, if I put it back on, it syncs in no-time. You should be aware of one thing though: If you have a DRBD splitbrain now and your primary crashes whilst in splitbrain, heartbeat will never be able to start your resource on the secondary node, as the DRBD resource is outdated. Read: Your resource will not run at all, not even with old data. You will have to manually do something like drbdadm -- --overwrite-data-of-peer primary $resource to get the device into primary state on an outdated disconnected secondary. When the crashed primary comes back, you need to drbdadm -- --discard-my-data connect $resource (on the crashed primary) to get it in sync again - heartbeat is not able to do that on its own (Which is good. It shouldnt know a way to force a device to primary state). Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DBRD - split brain - and HA is happily migrating
Thomas Glanzmann wrote: Hello, I have drbd (newest version; same goes for heartbeat) running as a master/slave ressource on the latest heart beat ressource and had the following problem. I had a split brain situation In DRBD or in the entire cluster? and heartbeat made it possible to migrate from one node to another and I wonder how that is possible? How do other people handle this situation. My setup so far is the following: You didnt give your drbd.conf, but I suppose you do not use DRBD resource fencing. Without resource fencing, it is perfectly possible to execute drbdadm primary $resource on a disconnected secondary. Take a look at the resource fencing function in DRBD. The primary will then use another communication path to set the secondary resource to an outdated status. drbdadm primary on an outdated resource will not succeed unless you manually force it (the DRBD RA does not do that). Read the man drbd.conf section about it and this link is also worth a read: http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/ Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD - ext3 - IP - PostgreSQL Setup
Thomas Glanzmann wrote: Hello, I have the following setup: DRBD = ext3 = IPaddr2 = pgsql. I have the following configured: 00_README:# Ressourcen hinzufügen: 00_README: 00_README:cibadmin -o resources -C -x 01_drbd 00_README:cibadmin -o resources -C -x 02_filesystem 00_README:cibadmin -o constraints -C -x 03_constraint_run_on 00_README:cibadmin -o constraints -C -x 04_order_drbd_before_fs0 00_README:cibadmin -o constraints -C -x 05_colocation_drbd_master_on_fs0 00_README: 00_README:cibadmin -o resources -C -x 06_ip_address 00_README:cibadmin -o resources -C -x 07_pgsql 00_README: 00_README:cibadmin -o constraints -C -x 08_order_fs0_before_pgsql 00_README:cibadmin -o constraints -C -x 09_order_ip0_before_pgsql 00_README:cibadmin -o constraints -C -x 10_colocation_pgsql_ip0 00_README:cibadmin -o constraints -C -x 11_colocation_pgsql_fs0 00_README:cibadmin -o constraints -C -x 12_colocation_fs0_ip0 00_README: 00_README:# DRBD / FS starten: 00_README: 00_README:crm_resource -r ms-drbd0 -v '#default' --meta -p target_role 00_README:crm_resource -r fs0 -v '#default' --meta -p target_role 00_README:crm_resource -r pgsql0 -v '#default' --meta -p target_role 00_README: 00_README:00_README 00_README:01_drbd 00_README:02_filesystem 00_README:03_constraint_run_on 00_README:04_order_drbd_before_fs0 00_README:05_colocation_drbd_master_on_fs0 00_README:06_ip_address 00_README:07_pgsql 00_README:08_order_fs0_before_pgsql 00_README:09_order_ip0_before_pgsql 00_README:10_colocation_pgsql_ip0 00_README:11_colocation_pgsql_fs0 00_README:12_colocation_fs0_ip0 01_drbd: master_slave id=ms-drbd0 01_drbd: meta_attributes id=ma-ms-drbd0 01_drbd: attributes 01_drbd: nvpair id=ma-ms-drbd0-1 name=clone_max value=2/ 01_drbd: nvpair id=ma-ms-drbd0-2 name=clone_node_max value=1/ 01_drbd: nvpair id=ma-ms-drbd0-3 name=master_max value=1/ 01_drbd: nvpair id=ma-ms-drbd0-4 name=master_node_max value=1/ 01_drbd: nvpair id=ma-ms-drbd0-5 name=notify value=yes/ 01_drbd: nvpair id=ma-ms-drbd0-6 name=globally_unique value=false/ 01_drbd: nvpair id=ma-ms-drbd0-7 name=target_role value=stopped/ 01_drbd: /attributes 01_drbd: /meta_attributes 01_drbd: primitive id=drbd0 class=ocf provider=heartbeat type=drbd 01_drbd: instance_attributes id=ia-drbd0 01_drbd: attributes 01_drbd: nvpair id=ia-drbd0-1 name=drbd_resource value=postgres/ 01_drbd: /attributes 01_drbd: /instance_attributes 01_drbd: /primitive 01_drbd: /master_slave 02_filesystem:primitive class=ocf provider=heartbeat type=Filesystem id=fs0 02_filesystem:meta_attributes id=ma-fs0 02_filesystem:attributes 02_filesystem:nvpair name=target_role id=ma-fs0-1 value=stopped/ 02_filesystem:/attributes 02_filesystem:/meta_attributes 02_filesystem: 02_filesystem:instance_attributes id=ia-fs0 02_filesystem:attributes 02_filesystem:nvpair id=ia-fs0-1 name=fstype value=ext3/ 02_filesystem:nvpair id=ia-fs0-2 name=directory value=/srv/postgres/ 02_filesystem:nvpair id=ia-fs0-3 name=device value=/dev/drbd0/ 02_filesystem:/attributes 02_filesystem:/instance_attributes 02_filesystem:/primitive 03_constraint_run_on:rsc_location id=drbd0-placement-1 rsc=ms-drbd0 03_constraint_run_on:rule id=drbd0-rule-1 score=-INFINITY 03_constraint_run_on:expression id=exp-01 value=postgres-01 attribute=#uname operation=ne/ 03_constraint_run_on:expression id=exp-02 value=postgres-02 attribute=#uname operation=ne/ 03_constraint_run_on:/rule 03_constraint_run_on:/rsc_location 04_order_drbd_before_fs0:rsc_order id=drbd0_before_fs0 from=fs0 action=start to=ms-drbd0 to_action=promote/ 05_colocation_drbd_master_on_fs0:rsc_colocation id=fs0_on_drbd0 to=ms-drbd0 to_role=master from=fs0 score=infinity/ 06_ip_address:primitive class=ocf provider=heartbeat type=IPaddr2 id=ip0 06_ip_address:meta_attributes id=ma-ip0 06_ip_address:attributes 06_ip_address:nvpair name=target_role id=ma-ip0-1 value=stopped/ 06_ip_address:/attributes 06_ip_address:/meta_attributes 06_ip_address: 06_ip_address:instance_attributes id=ia-ip0 06_ip_address:attributes 06_ip_address:nvpair id=ia-ip0-1 name=ip value=172.17.0.20/ 06_ip_address:nvpair id=ia-ip0-2 name=cidr_netmask value=24/ 06_ip_address:nvpair id=ia-ip0-3 name=nic value=eth0.2/ 06_ip_address:/attributes 06_ip_address:/instance_attributes 06_ip_address:/primitive 07_pgsql:primitive class=ocf provider=heartbeat type=pgsql id=pgsql0 07_pgsql:
Re: [Linux-HA] DRBD Config
we are using a two node cluster master/slave with an openSuSE 10.3, heartbeat 2.0.7 and drbd 8.0.6. I tried the configuration from this webpage: http://www.linux-ha.org/DRBD/HowTov2 This should only be done with a recent version of heartbeat/crm. There have been major improvements on multistate resources since 2.0.7. I suggest to try the 2.1.3 testing version from http://hg.linux-ha.org/test/archive/tip.tar.gz or at least the latest interim build from http://download.opensuse.org/repositories/server:/ha-clustering/openSUSE_10.3/ Especially the testing version works like a charm with DRBD8 here, although the interim build should fine as well. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD Config
2) DRBD8 is NOT supported from heartbeat. Please use DRBD0.7 I know the howto states so, but did you try it? Works for me ... Imho, the docs are outdated about that. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD Config
Dec 20 12:57:49 mylogin1 drbd[7119]: [7131]: DEBUG: : Calling /sbin/drbdadm -c /etc/drbd.conf state Dec 20 12:57:49 mylogin1 drbd[7119]: [7134]: DEBUG: : Exit code 0 can you c/p what you get when you issue /sbin/drbdadm -c /etc/drbd.conf state by hand? That's a syntax error. The resource is missing (as stated in the first email). Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD Config
Please see my thread from Oct 18th on this list and esp the answer from Andrew from Oct 22nd. I read that thread. There also was someone else who stated it's working for him. Can you confirm that It works? For LARGE partitions? Would be good news! The largest partition I manage with DRBD v8 in heartbeat has 130 GB. Works as expected. If you look at the commands the RA issues - why should it not? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Relevance of STONITH with Xen over DRBD setup
I just want to confirm this. From what i've learned so far, STONITH is relevant only to avoid data corruption when using shared storage. So, is STONITH relevant when i'm using a non-shared setup with Heartbeat and XEN VM on top of DRBD? Xen is using file images created on top of a ext3 FS on top of DRBD, and there should not be any concurrent access. Here you say yourself that you need STONITH :) You might as well be fine with DRBD resource fencing. As long as communication paths are redundant, you *should* not end up with a domU running on both dom0s. You still might, if DRBD resource fencing does not work properly. To be really sure, I'd rather have a stonith device. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ordering Resource Groups V2
Damon Estep wrote: I have created an order constraint that requires a DRBD/iSCSI target resource group to be up before an application resource groups comes up. At startup the order is honored, and the resource groups come up in the desired order. In the event of a failover in the storage group I would like the application groups to go offline while the storage group fails over to another node, otherwise the applications will crash because they have lost access to the storage. The application resource groups do not stop while the storage group recovers. If two groups are created, call them resource1 and resource2, and then an order constraint is created where resource1 before resource2, should resource2 go offline during a failover of resource1? In my test setup they do not. Why don't you just use one group? That should give you the intended behaviour. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: A question about the combined score
If the error occurs on the resource, then resouce-failure-stickiness will come to play and make your scores: Node1: 10 - 10 = 0 Node2: 9 As 9 0, the resource will be started on Node2, and 22 stickiness will be added. So you have 31 0. In your comments, you remarked a failover should occur if the following condition met. Node1_preference + Node1_failcount * failure_stickness Node2_preference + Node2_failcount * failure_stickness Is my understanding correct? Sounds good to me. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY
I have created simple 2-node cluster with 4 drbd multi-state resources and xen DomU on it with enabled stonith and setting: Why don't you use drbd natively in xen? In your drbd installation, you should find a script named block-drbd. Copy that to /etc/xen/scripts and config your domU like: disk = [ 'drbd:drbd1,hda1,w' ] That way, you only need the xen resource in heartbeat and don't have to worry about master/slave resources at all. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY
And to answer to your question: I have created simple 2-node cluster with 4 drbd multi-state resources and xen DomU on it with enabled stonith and setting: default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY. The idea for cluster working is: 1. promote all drdd resources on node sles236 and start drbd resources on node sles238. (works OK) 2. when failure of drbd occurs - promote all drbd resources on sles238 and reboot sles236. 3. when sles236 join back to the cluster after reboot, leave drbd promoted on sles238. I thought that setting: default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY guarantee this bahaviour but in fact I have: - when sles236 join back to the cluster after reboot, all drbd resources are demoted on sles238 and promoted on sles36. Where is mistake in my cib.xml? ... constraints rsc_location id=pref_location_drbd0 rsc=ms-drbd0 rule id=sles236_location_drbd0 score=100 boolean_op=and role=Master expression attribute=#uname id=drbd0_on_sles236 operation=eq value=sles236/ /rule /rsc_location rsc_location id=pref_location_drbd1 rsc=ms-drbd1 rule id=sles236_location_drbd1 score=100 boolean_op=and role=Master expression attribute=#uname id=drbd1_on_sles236 operation=eq value=sles236/ /rule /rsc_location rsc_location id=pref_location_drbd2 rsc=ms-drbd2 rule id=sles236_location_drbd2 score=100 boolean_op=and role=Master expression attribute=#uname id=drbd2_on_sles236 operation=eq value=sles236/ /rule /rsc_location rsc_location id=pref_location_drbd3 rsc=ms-drbd3 rule id=sles236_location_drbd3 score=100 boolean_op=and role=Master expression attribute=#uname id=drbd3_on_sles236 operation=eq value=sles236/ /rule I think this is the cause. You prefer to run drbd on sles236. /rsc_location rsc_order id=drbd0_before_tr2_xen from=xen_tr2 action=start to=ms-drbd0 to_action=promote/ rsc_order id=drbd1_before_tr2_xen from=xen_tr2 action=start to=ms-drbd1 to_action=promote/ rsc_order id=drbd2_before_tr2_xen from=xen_tr2 action=start to=ms-drbd2 to_action=promote/ rsc_order id=drbd3_before_tr2_xen from=xen_tr2 action=start to=ms-drbd3 to_action=promote/ rsc_colocation id=col_xen_drbd0_master from=xen_tr2 from_role=Started to=ms-drbd0 to_role=Master score=INFINITY/ rsc_colocation id=col_xen_drbd1_master from=xen_tr2 from_role=Started to=ms-drbd1 to_role=Master score=INFINITY/ rsc_colocation id=col_xen_drbd2_master from=xen_tr2 from_role=Started to=ms-drbd2 to_role=Master score=INFINITY/ rsc_colocation id=col_xen_drbd3_master from=xen_tr2 from_role=Started to=ms-drbd3 to_role=Master score=INFINITY/ /constraints /configuration /cib Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd
China wrote: Ok, now it works, but when the PC_A returns up the resource doesn't remains on PC_B and failback to PC_A. How can I configure to switch the first time to PC_B on PC_A failover, but not return back if PC_A returns UP? Set resource stickiness to a reasonable value. Here's roughly how stickiness works: When you startup, the scores are calculated from your constraints and a decision is made where to run the resource (PC_A in your case). Afterwards, the stickiness is added to the score for PC_A because the resource is running there. Now PC_A fails. The node with the highest score for your resource gets to run it (PC_B). After a successful start, the stickiness is added to the score for PC_B. Now PC_A comes back. It will get its normal score from your constraints, but no stickiness because the resource is not running there. So if you make (PC_B score + stickiness) (PC_A score), the resource will stay on PC_B. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Possible bug in Score calculation?
Hi sorry I have to bother again about score calculation but I came across something I don't understand and that might be a bug. I have a master-slave drbd resource called ms-drbd (primitive is called drbd2) and a group named testdb (4 primitives, mount being the first primitive). pingd multiplier is 300 (5 ping nodes) default-resource-stickiness is 200 And these are my constraints: constraints rsc_order id=drbd2_before_testdb from=testdb action=start to=ms-drbd2 to_action=promote/ rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 to_role=master from=testdb score=300/ /constraints When I look at the scores, I see drbd2:0 node1 0 Okay. drbd2:0 node2 676 Weird This looks like the crm_master value from the drbd OCF RA plus stickiness times (number_of_group_items - 1). How I come to think this has to do with the group? Because if I add another item to the group, the score increases by 1 x stickiness. Guess this shouldn't be, should it? drbd2:1 node1 76 crm_master value from drbd OCF RA drbd2:1 node2 -infinity Okay. mount node1 0 Okay. mount node2 1100 300 (constraint) + 4 x 300 (pingd) Regards Dominik Here's the cib.xml configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair name=no-quorum-policy value=stop id=no-quorum-policy/ nvpair name=symmetric-cluster value=true id=symmetric-cluster/ nvpair name=stonith-enabled value=false id=stonith-enabled/ nvpair name=stonith-action value=reboot id=stonith-action/ nvpair name=default-resource-stickiness value=200 id=default-resource-stickiness/ nvpair name=default-resource-failure-stickiness value=-100 id=default-resource-failure-stickiness/ nvpair name=is-managed-default value=true id=is-managed-default/ nvpair name=default-action-timeout value=20s id=default-action-timeout/ nvpair name=stop-orphan-resources value=true id=stop-orphan-resources/ nvpair name=stop-orphan-actions value=true id=stop-orphan-actions/ nvpair name=remove-after-stop value=false id=remove-after-stop/ nvpair name=pe-error-series-max value=-1 id=pe-error-series-max/ nvpair name=pe-warn-series-max value=-1 id=pe-warn-series-max/ nvpair name=pe-input-series-max value=-1 id=pe-input-series-max/ nvpair name=startup-fencing value=true id=startup-fencing/ /attributes /cluster_property_set /crm_config resources master_slave id=ms-drbd2 meta_attributes id=ma-ms-drbd2 attributes nvpair id=ma-ms-drbd2-1 name=clone_max value=2/ nvpair id=ma-ms-drbd2-2 name=clone_node_max value=1/ nvpair id=ma-ms-drbd2-3 name=master_max value=1/ nvpair id=ma-ms-drbd2-4 name=master_node_max value=1/ nvpair id=ma-ms-drbd2-5 name=notify value=yes/ nvpair id=ma-ms-drbd2-6 name=globally_unique value=false/ nvpair id=ma-ms-drbd2-7 name=target_role value=started/ /attributes /meta_attributes primitive id=drbd2 class=ocf provider=heartbeat type=drbd instance_attributes id=ia-drbd2 attributes nvpair id=ia-drbd2-1 name=drbd_resource value=drbd2/ /attributes /instance_attributes operations op id=op-ms-drbd2-1 name=monitor interval=5s timeout=5s start_delay=30s role=Master/ op id=op-ms-drbd2-2 name=monitor interval=6s timeout=5s start_delay=30s role=Slave/ /operations /primitive /master_slave group id=testdb meta_attributes id=ma-testdb attributes nvpair id=ma-testdb-1 name=target_role value=started/ /attributes /meta_attributes primitive id=mount class=ocf type=filesystem provider=heartbeat ... /primitive primitive id=postgres class=ocf type=pgsql provider=heartbeat ... /primitive primitive id=masterip class=ocf type=IPaddr2 provider=heartbeat ... /primitive primitive id=slon class=ocf type=dkcustom provider=dk ... /primitive /group /resources constraints rsc_order id=drbd2_before_testdb from=testdb action=start to=ms-drbd2 to_action=promote/
Re: [Linux-HA] Pingd
With this configuration the resources doesn't failover to test, but remains on test-ppc. Why? I can't say. The configuration looks good to me. But again: What are you doing to force the failure? Do you really have just one connection between the nodes and unplug that connection to force the failure? This is meant to cause problems. And you don't even have a STONITH configuration which might harm it a little. Still, only one connection between the nodes is a problem. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd
China wrote: Sorry, I forgot it! I've two connection for the PCs: one with crossover cable, where heartbeat send packets directly to other PC one through network, where the services listen and where pingd test connectivity When I force the failure I disconnect the network cable that give services from PC_A. Use both connections for heartbeat. Maybe use ucast on the external interface instead of bcast. Your supplied config file only reads one interface: -- *ha.cf: use_logd yes compression zlib coredumps no keepalive 1 warntime 2 deadtime 5 deadping 5 udpport 694 bcast eth2 node test-ppc test3 ping 192.168.122.113 #respawn hacluster /rnd/apps/components/heartbeat/lib/heartbeat/ipfail #respawn root /rnd/apps/components/heartbeat/lib/heartbeat/pingd -m 100 -d 5s #auto_failback off crm yes - But if it really is the way you tell, this should not cause the problem you told about. Just a thought: If - by any chance - you pull the plug out of the connection that sends/receives the heartbeats, you will have a splitbrain situation which would nicely explain the things you mentioned. So please use both links for heartbeat cluster communication and make sure you pull the right plug :) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd
But, It's good to use a interface both for heartbeat and for services? It's pretty common I think. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd
China wrote: Last question: how can I see what is the node's score during cluster execution? You can grep it out of the ptest output. Or use my script: http://lists.community.tummy.com/pipermail/linux-ha/2007-September/027488.html which has been updated by Robert Lindgren: http://lists.community.tummy.com/pipermail/linux-ha/2007-September/027745.html Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Possible bug in Score calculation?
Good morning Andrew sorry I have to bother again about score calculation but I came across something I don't understand and that might be a bug. I have a master-slave drbd resource called ms-drbd (primitive is called drbd2) and a group named testdb (4 primitives, mount being the first primitive). pingd multiplier is 300 (5 ping nodes) default-resource-stickiness is 200 And these are my constraints: constraints rsc_order id=drbd2_before_testdb from=testdb action=start to=ms-drbd2 to_action=promote/ rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 to_role=master from=testdb score=300/ /constraints When I look at the scores, I see drbd2:0 node1 0 Okay. drbd2:0 node2 676 Weird This looks like the crm_master value from the drbd OCF RA plus stickiness times (number_of_group_items - 1). that would make sense. How I come to think this has to do with the group? Because if I add another item to the group, the score increases by 1 x stickiness. Guess this shouldn't be, should it? it should. the group needs to go with the master... so it makes sense that you should take the groups location preferences into account when deciding where to place and promote drbd Full ack on that, but I didn't configure it (at least not that I knew of - yet). So it's known and wanted to work this way. Sweet! Just curious: I suppose its my first constraint that does this job? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Possible bug in Score calculation?
Just curious: I suppose its my first constraint that does this job? second - the colocation one Okay thanks - so sure I even got the 50/50 wrong :p Then I must ask another question: Why does this not apply to colocated primitives? I just tested with a single primitive (testdb) colocated to the master role of ms-drbd2: drbd2:0 node1 0 drbd2:0 node2 76 drbd2:1 node1 76 drbd2:1 node2 -infin testdb node1 0 testdb node2 500 resources master_slave id=ms-drbd2 meta_attributes id=ma-ms-drbd2 attributes nvpair id=ma-ms-drbd2-1 name=clone_max value=2/ nvpair id=ma-ms-drbd2-2 name=clone_node_max value=1/ nvpair id=ma-ms-drbd2-3 name=master_max value=1/ nvpair id=ma-ms-drbd2-4 name=master_node_max value=1/ nvpair id=ma-ms-drbd2-5 name=notify value=yes/ nvpair id=ma-ms-drbd2-6 name=globally_unique value=false/ nvpair id=ma-ms-drbd2-7 name=target_role value=started/ /attributes /meta_attributes primitive id=drbd2 class=ocf provider=heartbeat type=drbd instance_attributes id=ia-drbd2 attributes nvpair id=ia-drbd2-1 name=drbd_resource value=drbd2/ /attributes /instance_attributes operations op id=op-ms-drbd2-1 name=monitor interval=5s timeout=5s start_delay=30s role=Master/ op id=op-ms-drbd2-2 name=monitor interval=6s timeout=5s start_delay=30s role=Slave/ /operations /primitive /master_slave primitive id=testdb class=ocf type=filesystem provider=heartbeat instance_attributes id=ia-mount attributes nvpair id=ia-mount-1 name=device value=/dev/drbd2/ nvpair id=ia-mount-2 name=directory value=/packages/postgres/data// nvpair id=ia-mount-3 name=fstype value=ext3/ /attributes /instance_attributes operations op id=op-mount-1 name=monitor interval=5s timeout=5s start_delay=30s role=Started/ /operations /primitive /resources constraints rsc_order id=drbd2_before_testdb from=testdb action=start to=ms-drbd2 to_action=promote/ rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 to_role=master from=testdb score=300/ /constraints /configuration ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 1000 extra score for a group?
i defined a rsc_location constraint with a rule with score 100 for a particular node for my HA-group. Resource stickiness is 200. Furthermore I use the a rule with score_attribute=pingd (multiplier=100) for the group. With 5 available ping nodes this should make 100+200+500=800. But the score I see is 1800. So somehow 1000 were added here. how many resources in the group? 6 by any chance? that would make the stickiness 6*200 = 1200 (instead of 200) and explain the extra 1000. You hit the nail on the head. Thanks for explaining this. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] RFC: Change on OCF RA Filesystem's monitor action
Grepping by device doesn't work for mount-by-label at least, and requires a lot of escaping for networked mounts; so we thought grepping for the mountpoint was exactly the approach we needed to take. Never used mount-by-label. Good point though. I've got to admit I've never had someone use a symlink as a mountpoint. ;-) /usr/local/postgres/data is my mountpoint. /usr/local/postgres points to /usr/local/postgres-version I think this is not too uncommon. On the other hand I could just symlink /usr/local/postgres/data to /usr/local/data and let that be my mountpoint. Or change my postgres config to use another data dir ... Maybe the right answer would be for the RA to dereference the symlink then? Well one would have to check each directory in the path separately I think. At least I don't know a better way that could find out wether there is a symlink in the path or not. I think this would be a lot of work for not too much effort. And propably a slower monitor action. Imho: Let's forget about my suggestion. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stdout and stderr redirection in a resource agent
nohup $binfile $cmdline_options $logfile 2 $errorlogfile ^^ ^^^ That binary is invoked from the RA, right? Sure. So neither stdout nor stderr should be said so Linux-HA should not log it. I'm just curious why it does. Well, despite your redirections, something shows up on stdout (or stderr). If it does, then it's logged. I really can't say why it does show up though. I hate to admit it, but it was an error in my RA. What else ... The RA supports to configure $logfile and $errlogfile. If both is set, then the syntax above is used and actually works. If $errlogfile is not set, then both stdout and stderr go to $logfile. But apparently there was a syntax error and that was the reason for stderr showing in the logs. Thanks though. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] RFC: Change on OCF RA Filesystem's monitor action
Hi I would like to suggest a change at the Filesystem RA. The monitor action actually does something like grep $MOUNTPOINT /etc/mtab. This does not work if you use a symbolic link as a mountpoint. If instead it would grep for $DEVICE (maybe with -w to avoid problems with +10 partitions on one disc), this would still find out if the filesystem is mounted and solve that problem. What do you think? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] stdout and stderr redirection in a resource agent
Hi ... again ... I wrote my own little RA to start a custom binary. Very basic RA up to now. I start my binary with nohup $binfile $cmdline_options $logfile 2 $errorlogfile Works ok actually, the logfiles are filled as expected, but I also see some of the output in the Linux-HA log (/var/log/messages). --- lrmd: [2618]: info: RA output: --- How to avoid this? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stdout and stderr redirection in a resource agent
Dejan Muhamedagic wrote: Hi, On Wed, Nov 07, 2007 at 04:17:45PM +0100, Dominik Klein wrote: Hi ... again ... I wrote my own little RA to start a custom binary. Very basic RA up to now. I start my binary with nohup $binfile $cmdline_options $logfile 2 $errorlogfile Works ok actually, the logfiles are filled as expected, but I also see some of the output in the Linux-HA log (/var/log/messages). --- lrmd: [2618]: info: RA output: --- How to avoid this? Currently there is no way around that. We assume that if the RA says something, perhaps it's important, so it is logged. Why would you want to avoid it? I start my binary with nohup $binfile $cmdline_options $logfile 2 $errorlogfile ^^ ^^^ So neither stdout nor stderr should be said so Linux-HA should not log it. I'm just curious why it does. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?
Hi a week earlier I asked wether there was a resource agent that implements Master/Slave for a Postgres Cluster using slony-1 replication. There was not, so I tried to implement it myself. I want to report back to give an explanation and reference on why I think it is not possible (at the moment) to implement this in heartbeat. Here we go: Short summary of slony-1 replication: In a slony-1 replication setup * Tables are put together to replication sets * Each set has an origin (master) * Only the origin can be written to * There can be multiple sets with a different origin each * There can be multiple subscribers (slaves) for each set * Subscribers are read-only As you have to somewhat connect the master role to the health of postgres itself, this restricts you to the use of only one set or manage all sets at once. Well, okay, I think I could live with this. Slony-1 implements two commands for switchover and failover. I mean Switchover when I want to do a planned switch of roles when all machines are healthy. I mean failover when the Master has a problem and the Slave takes over. So now comes the tricky part. In slony-1 you cannot make an origin a subscriber without making another subscriber the new origin. This happens in ONE command. So there are no independent demote and promote commands. In a two machine setup you cannot have two slaves at a time. In other words: Promote implicitely demotes the other machine, Demote implicitely promotes the other machine. So I thought I could implement demote as return 0, as promote on the other machine will do the job anyway. Well, not the best idea as a monitor action on the apparently demoted machine will still return Master Status until promote on the second machine finished. Furthermore, the switchover command will fail if the other machine is not responding. In case the current master really has a problem, all you can do get a writeable database on the current slave is to use the failover command. But Linux-HA only knows promote and demote. So I implemented some promote and demote the following way: promote if switchover_to_me then return 0 else if ! switchover_to_me then failover_to_me return $? fi fi demote switchover_to_other_machine # dont care if this works as it cannot work if # the other machine is not healthy return 0 What you also need to know about slony-1 is the fact that you need to resync the COMPLETE data after a failover. In slony-1 it is not possible to let a failed node rejoin the slony-Cluster (even if it was healthy when the failover command was issued). It has to fetch ALL data from the new master. So you want to avoid failover if it is not absolutely necessary. Up to now I thought my RA could handle a few cases and it turns out: SOME it can handle (like master reboot or slave reboot or controlled switchover). But a simple thing as killing postgres on the master machine causes a failover. Why?: Say A is master, B is slave at this moment 1. monitor on A fails 2. Linux-HA executes demote on A - As you see above, this will work even if it does nothing 3. Linux-HA executes promote on B - This, as postgres on A is not running, will end up in a failover (see above) This is pretty much it. If you have any ideas on how to improve this or if you also think that this is impossible with the current master/slave implementation in Linux-HA - please respond. The whole separately demote and promote approach in Linux-HA seems to just not fit the way slony-1 handles switchover and failover. If you have any more questions (it can well be I forgot something), just ask - I'll be happy to help improve Linux-HA. Best regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?
Hi Andrew thanks for your reply. So I thought I could implement demote as return 0, as promote on the other machine will do the job anyway. Well, not the best idea as a monitor action on the apparently demoted machine will still return Master Status until promote on the second machine finished. What if the crm delayed the slave's monitor until after the other side was promoted... would that help significantly? That would propably prevent one failed monitor action in this very special case. Furthermore, the switchover command will fail if the other machine is not responding. In case the current master really has a problem, all you can do get a writeable database on the current slave is to use the failover command. But Linux-HA only knows promote and demote. So I implemented some promote and demote the following way: promote if switchover_to_me then return 0 else if ! switchover_to_me then failover_to_me return $? fi fi demote switchover_to_other_machine # dont care if this works as it cannot work if # the other machine is not healthy return 0 What you also need to know about slony-1 is the fact that you need to resync the COMPLETE data after a failover. In slony-1 it is not possible to let a failed node rejoin the slony-Cluster (even if it was healthy when the failover command was issued). It has to fetch ALL data from the new master. So you want to avoid failover if it is not absolutely necessary. Up to now I thought my RA could handle a few cases and it turns out: SOME it can handle (like master reboot or slave reboot or controlled switchover). But a simple thing as killing postgres on the master machine causes a failover. Why?: Say A is master, B is slave at this moment 1. monitor on A fails 2. Linux-HA executes demote on A - As you see above, this will work even if it does nothing 3. Linux-HA executes promote on B - This, as postgres on A is not running, will end up in a failover (see above) Notifications might help. The Filesystem agent (when operating in OCFS2 mode) keeps a list of who its peers are. If you did the same then I think you'd be able to recognize that you're all alone and that it was ok to switchover_to_me instead. Read my first post again. Switchover is not possible if the other postgres instance is not available. The only way to make a single slave the new master is to use the failover command. What *would* help here is: 1. monitor on A fails - OCF_NOT_RUNNING Now, instead of demote A, promote B: 2. Stop/Start the resource on A Iirc start includes a monitor action (or probe called sometimes in this case). This would report OCF_RUNNING_MASTER, so the problem would be solved. On the other hand, this is propably a pretty big change in Linux-HA's master/slave handling and this should be discussed. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems