Re: [Linux-HA] type: pseduo
On Mon, Jul 4, 2011 at 7:06 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi, found this syslog message on a SLES11 SP1 system: Jul 4 10:55:14 rksaph02 crmd: [11517]: WARN: print_elem: [Action 83]: Pending (id: grp_t11_as2_stopped_0, type: pseduo, priority: 2570) I guess the type should be pseudo... yes ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Q: ms-resources and grouping
On Thu, Jun 30, 2011 at 7:41 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! I have a question: when I want to have a filesystem on a logical volume, where the VG is on a RAID1, I would typically have three resources to handle that. Now if I wish to have a clone or ms resource, how could I connect the resources so that the resource's nodes find the desired filesystems? order and colocation? Correct :) ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Help for the floating IPaddress monitoing
On Thu, Jun 30, 2011 at 3:47 PM, 徐斌 robin@163.com wrote: Hi Gent, I want to let the floating IP running again after restart the network. But I met the issue when I enable the monitoring for the floating ip (using ocf:heartbeat:IPaddr2). [root@master ~]# crm configure show ip2 primitive ip2 ocf:heartbeat:IPaddr2 \ params ip=172.20.33.88 nic=eth1 iflabel=0 cidr_netmask=255.255.255.0 \ op monitor interval=10s The IP address configured on the eth0 was lost, and it's so bad that I cannot up the nic before I stop the heartbeat. eth1 Link encap:Ethernet HWaddr 08:00:27:11:87:63 inet6 addr: fe80::a00:27ff:fe11:8763/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127723 errors:0 dropped:0 overruns:0 frame:0 TX packets:7288 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11752527 (11.2 MiB) TX bytes:1331914 (1.2 MiB) eth1:0 Link encap:Ethernet HWaddr 08:00:27:11:87:63 inet addr:172.20.33.88 Bcast:172.20.33.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 [root@master ~]# ifup eth1 [root@master ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 08:00:27:11:87:63 inet6 addr: fe80::a00:27ff:fe11:8763/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127956 errors:0 dropped:0 overruns:0 frame:0 TX packets:7362 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11773863 (11.2 MiB) TX bytes:1341574 (1.2 MiB) I think there maybe a time racing for the '/etc/init.d/network' and 'pacemaker', if 'pacemaker' start the eth1:0 first, then it will not set the IP address for eth1. Right, the resource agent assumes the device is always up and only adds/removes aliases. Does anyone also have the issue? and are there any other way to restart the floating IP without enable the monitoring operation. Regards, -robin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Question on order rule
On Thu, Jun 9, 2011 at 4:58 AM, Alessandro Iurlano alessandro.iurl...@gmail.com wrote: Hello. I'm trying to setup an highly available OpenVZ cluster. As OpenVZ only supports disk quota on ext3/4 local filesystems (nfs and gfs/ocfs2 don't work), I have setup two iscsi volumes on an highly available storage where VMs will be stored. I would like to use three servers, two active and one spare, for this cluster. That's because OpenVZ expects all the VMs to be under /vz or /var/lib/vz directory. This leads to the constraint that each server can have only on iscsi volume attached and mounted on /vz. Also, as the ext3 fs is not a cluster filesystem, each iscsi volume has to be mounted on a single server at time. So I need to map two iscsi volumes to three servers. I have created a pacemaker configuration with two groups, group-iscsi1 and group-iscsi2, that take care of connecting the iscsi devices and mounting the filesystems on /vz. A negative colocation directive forbids the two groups from being active on the same cluster node at the same time. So far things are working. The problem is with the resource that controls OpenVZ. It is a lsb:vz primitive that needs to be a clone (I can't make it into separate lsb-vz1 or lsb-vz2 primitives as the cluster sees both active on every node because they refer to the same /etc/init.d/vz script). After creating the clone clone-vz, I defined the location constraints as location vz-on-iscsi inf: group-iscsi1 clone-vz and did the same for group-iscsi2. The clone resource needs to be started after filesystems are mounted, so I need two order constraints like order vz-after-fs1 inf: group-iscsi1:start clone-vz:start and order vz-after-fs2 inf: group-iscsi2:start clone-vz:start This works perfectly when both iscsi volumes are up. But if one of them is stopped, clone-vz is not starting. I guess this is because the two order constraints create dependencies for clone-vz on both the iscsi groups so that the clone is started only when group-iscsi1 AND group-iscsi2 have started. Is there a way I can tell pacemaker to start clone-vz on a node if that node has resource group-iscsi1 OR group-iscsi2? Not yet I'm afraid. Resource sets will eventually allow this but I don't think its there yet. The complete pacemaker configuration is here: http://nopaste.info/8c2ba79159.html Thanks, Alessandro ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Web resource monitoring
On Mon, Jun 27, 2011 at 3:19 AM, Maxim Ianoglo dot...@gmail.com wrote: Hello, The http monitoring code should be split off from the apache RA. Then a simple stateless (see the Dummy RA for a sample) RA, say httpmon, can be created which would source the http monitoring. Patches accepted! Guidance and constructive critique offered :) Ok, thank you for suggestion with name of RA :) Here is what I wrote: https://github.com/dotNox/heartbeat_resources/blob/master/httpmon Small description is available at: http://dotnox.net/2011/06/multiple-ha-resources-based-on-same-service-heartbeat-httpmon-ra/ I did not want to patch or change something in apache RA as if someone will not have any cross of resources, like I have, and it will be easier to use apache RA. So if apache fails, how does this agent organise recovery? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] serial cable or ethnet cable for heartbeat, which one is better?
On Mon, Jun 27, 2011 at 3:03 AM, Hai Tao taoh...@hotmail.com wrote: Which one is better for heartbeat, a serial cable or a dedicated ethernet cable? Can the bandwidth of a serial cable be a bottleneck? If you're running pacemaker - yes. How much data is transferred on the heartbeat link? Thanks. Hai Tao ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] CIB process quits and could not connect to CRM
On Mon, May 16, 2011 at 11:36 PM, Mateusz Kalisiak mateusz.kalis...@gmail.com wrote: Hello, I'm struggling the same problem on RHEL 6. Does anyone have some idea of solving this out? Any help would be appreciated. You'd need to provide more details than that. Have you tried reading the logs? Best Regards, Mateusz ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Best way for colocating resource on a dual primary drbd
On Mon, May 16, 2011 at 5:38 PM, RaSca ra...@miamammausalinux.org wrote: Il giorno Lun 16 Mag 2011 09:01:08 CET, Andrew Beekhof ha scritto: [...] Implicit that once the resource go away it becomes slave? Pretty sure this is a bug in 1.0. Have you tried 1.1.5 ? Not yet, but so Andrew are you saying that keeping the colocation even if I have a dual primary drbd is the best thing to do? Yes. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Colocation of VIP and httpd
On Wed, Jun 1, 2011 at 12:04 PM, 吴鸿宇 whyfo...@gmail.com wrote: Thank you for your reply. My requirement is like this: The httpd service runs on every node in the cluster and is monitored by watchdog. VIP only runs on one node at a time. Heartbeat will check the status of httpd on each node and make sure the VIP runs on the node that has httpd running. Note that the Heartbeat is not expected to control the httpd service but just to monitor. Say I have Heartbeat, then I have the following configuration questions: 1) Should I use clone for monitoring httpd? No, you should clone the httpd service. Each instance of the clone is responsible for monitoring itself. 2) Which operation should I specified for the action of httpd service? fence or block or another? It depends what you want. Try reading the documentation for those options. Is the combination clone+action+colocation enough for the requirement above? If not, what else special configuration do I need? Thank you for any advices! Hongyu On Tue, May 24, 2011 at 12:48 AM, RaSca ra...@miamammausalinux.org wrote: Il giorno Gio 19 Mag 2011 19:25:54 CET, 吴鸿宇 ha scritto: Hi All, I have a 2 node cluster. My intention is ensuring the VIP is always on the node that has httpd running, i.e. if service httpd on the VIP node is stopped and fails to start, the VIP should switch to the other node. With the configuration below, I observed that when httpd stops and fails to start, the VIP is stopped also but is not switched to the other node that has healthy httpd. I appreciate any ideas. [...] Some questions: Why httpd is cloned? Are you sure you want an INFINITY stickiness? Are logs saying anything helpful? Anyway, like Nikita said, consider upgrading Heartbeat to version 3. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] need help on email alerts
On Sun, Jun 5, 2011 at 8:53 PM, Amit Jathar amit.jat...@alepo.com wrote: Hi, I have configured email alerts for corosync as follows :- Crm configure show ---SNIP- primitive resMON ocf:pacemaker:ClusterMon \ operations $id=resMON-operations \ op monitor interval=180 timeout=20 \ params extra_options=--mail-to x...@gmail.com ---SNIP--- I can see this resource is started :- crm_mon -1 ---SNIP resMON (ocf::pacemaker:ClusterMon): Started xx ---SNIP I can send mail from my machine :- [root@localhost] mail -s testmail xx We don't rely on a local mail server, instead we use libesmtp. You'll need to make sure that is configured - or use call mail from a script referenced by --external-agent . Cc: Null message body; hope that's ok I cannot get mails any mails if my cluster status changes. I could not see anything in the /var/log/maillog also. Is there any hint if I am missing out any configuration. Thanks, Amit This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Virtual mysql cluster ip is not accessible on port 3306
On Thu, Jun 23, 2011 at 12:06 AM, Calistus Che calistus@gmail.com wrote: Hi Guys, could any one of you help me? I just set up a 2 lb (master and slave) and 2 mysql cluster nodes db1 and 2. The servers have 2 interfaces private and public and loadbalancing is running on the private network. Everything is pretty running fine till now, but the only problem access to the virtual ip. I would greatly appreciate your help. Based on what? Regards KC ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Always Get a Billion Failed Actions
On Thu, Jun 16, 2011 at 8:38 PM, Robinson, Eric eric.robin...@psmnv.com wrote: crm_mon on my system displays a lot of failed actions, I guess because the init script for the resource is not fully lsb compliant? In any case, the resources seem to work okay and failover okay. How can I get rid of all those failed actions? This is the cluster detecting that RAs don't exist on those nodes. I think we added some extra logic to 1.1 that hid these when symmetric-cluster=false was specified. crm_mon output follows... Last updated: Thu Jun 16 03:32:32 2011 Stack: Heartbeat Current DC: ha07b.mydomain.com (6080642c-bad3-4bb8-80ba-db6b1f7a0735) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 3 Nodes configured, unknown expected votes 4 Resources configured. Online: [ ha07c.mydomain.com ha07b.mydomain.com ha07a.mydomain.com ] Resource Group: g_clust04 p_fs_clust04 (ocf::heartbeat:Filesystem): Started ha07a.mydomain.com p_vip_clust04 (ocf::heartbeat:IPaddr2): Started ha07a.mydomain.com p_mysql_001 (lsb:mysql_001): Started ha07a.mydomain.com p_mysql_230 (lsb:mysql_230): Started ha07a.mydomain.com p_mysql_231 (lsb:mysql_231): Started ha07a.mydomain.com p_mysql_232 (lsb:mysql_232): Started ha07a.mydomain.com p_mysql_233 (lsb:mysql_233): Started ha07a.mydomain.com p_mysql_234 (lsb:mysql_234): Started ha07a.mydomain.com p_mysql_235 (lsb:mysql_235): Started ha07a.mydomain.com p_mysql_236 (lsb:mysql_236): Started ha07a.mydomain.com p_mysql_237 (lsb:mysql_237): Started ha07a.mydomain.com p_mysql_238 (lsb:mysql_238): Started ha07a.mydomain.com p_mysql_239 (lsb:mysql_239): Started ha07a.mydomain.com p_mysql_240 (lsb:mysql_240): Started ha07a.mydomain.com p_mysql_241 (lsb:mysql_241): Started ha07a.mydomain.com p_mysql_242 (lsb:mysql_242): Started ha07a.mydomain.com p_mysql_243 (lsb:mysql_243): Started ha07a.mydomain.com p_mysql_244 (lsb:mysql_244): Started ha07a.mydomain.com p_mysql_245 (lsb:mysql_245): Started ha07a.mydomain.com p_mysql_246 (lsb:mysql_246): Started ha07a.mydomain.com p_mysql_247 (lsb:mysql_247): Started ha07a.mydomain.com p_mysql_248 (lsb:mysql_248): Started ha07a.mydomain.com p_mysql_249 (lsb:mysql_249): Started ha07a.mydomain.com p_mysql_250 (lsb:mysql_250): Started ha07a.mydomain.com p_mysql_251 (lsb:mysql_251): Started ha07a.mydomain.com p_mysql_252 (lsb:mysql_252): Started ha07a.mydomain.com p_mysql_253 (lsb:mysql_253): Started ha07a.mydomain.com p_mysql_254 (lsb:mysql_254): Started ha07a.mydomain.com p_mysql_255 (lsb:mysql_255): Started ha07a.mydomain.com p_mysql_256 (lsb:mysql_256): Started ha07a.mydomain.com p_mysql_257 (lsb:mysql_257): Started ha07a.mydomain.com p_mysql_258 (lsb:mysql_258): Started ha07a.mydomain.com p_mysql_259 (lsb:mysql_259): Started ha07a.mydomain.com p_mysql_260 (lsb:mysql_260): Started ha07a.mydomain.com p_mysql_261 (lsb:mysql_261): Started ha07a.mydomain.com p_mysql_262 (lsb:mysql_262): Started ha07a.mydomain.com p_mysql_263 (lsb:mysql_263): Started ha07a.mydomain.com p_mysql_264 (lsb:mysql_264): Started ha07a.mydomain.com p_mysql_265 (lsb:mysql_265): Started ha07a.mydomain.com p_mysql_266 (lsb:mysql_266): Started ha07a.mydomain.com p_mysql_267 (lsb:mysql_267): Started ha07a.mydomain.com p_mysql_268 (lsb:mysql_268): Started ha07a.mydomain.com p_mysql_269 (lsb:mysql_269): Started ha07a.mydomain.com p_mysql_270 (lsb:mysql_270): Started ha07a.mydomain.com p_mysql_271 (lsb:mysql_271): Started ha07a.mydomain.com p_mysql_272 (lsb:mysql_272): Started ha07a.mydomain.com p_mysql_273 (lsb:mysql_273): Started ha07a.mydomain.com p_mysql_274 (lsb:mysql_274): Started ha07a.mydomain.com p_mysql_275 (lsb:mysql_275): Started ha07a.mydomain.com p_mysql_276 (lsb:mysql_276): Started ha07a.mydomain.com p_mysql_277 (lsb:mysql_277): Started ha07a.mydomain.com p_mysql_009 (lsb:mysql_009): Started ha07a.mydomain.com p_mysql_021 (lsb:mysql_021): Started ha07a.mydomain.com p_mysql_052
Re: [Linux-HA] crm_report versus hb_report
Can you run it with -x and send me the screen output please? It should be copying /var/log/syslog to the collector directory before trying to call node_events on it. On Fri, Jun 17, 2011 at 6:04 PM, alain.mou...@bull.net wrote: Hi Andrew, ok thanks. So I tried it and got an error msg in the collector script around the node_events call : node_events `basename $logfile` $EVENTS_F it outputs here : grep: syslog: No such file or directory whereas I trace the things around it : # Parse for events echo $logfile echo $EXTRA_LOGS for l in $logfile $EXTRA_LOGS; do node_events `basename $logfile` $EVENTS_F and I got lofile : /var/log/syslog and I got one : -rw-r--r-- 1 root root 10549069 Jun 17 09:40 /var/log/syslog and variable EXTRA_LOGS is empty So the call seems to be : node_events syslog $EVENTS_F So at the end, in the whole report, I got a cluster-log.txt linked to syslog which does not exists I tried to modifiy substitute the line by : node_events /var/log/syslog $EVENTS_F and it does no more display the error about grep: syslog: No such file or directory but I don't know if it is really the good fix (as I got in both cases an empty events.txt but perhaps it is a coincidence ... Any idea ? Thanks Alain De : Andrew Beekhof and...@beekhof.net A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 17/06/2011 09:31 Objet : Re: [Linux-HA] crm_report versus hb_report Envoyé par : linux-ha-boun...@lists.linux-ha.org On Fri, Jun 17, 2011 at 5:30 PM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Jun 17, 2011 at 4:47 PM, alain.mou...@bull.net wrote: Hi, I just discover that on RH6 there is no more hb_report, it has been remove from cluster-glue rpm . Does the crm_report delivered in pacemaker rpm give the sames results as hb_report ? Yes. It re-uses much of the same gathering code but with a slightly revised design. In fact its also flag compatible with hb_report. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat three node configuration
On Thu, Jun 9, 2011 at 11:54 PM, Ricardo F ri...@hotmail.com wrote: What is the configuration for create a three node cluster?, Essentially you need Pacemaker on top. haresources based clusters were only designed for 2-nodes. i have this but the servers bring-up the shared ip at same time: ha.cflogfacility local0keepalive 2deadtime 10warntime 5initdead 30auto_failback offucast bond0 host1 host2 host3node host1node host2node host3 haresourceshost1 192.168.1.10/24/bond0 i use heartbeat 3.0.3 in a debian squeeze in all of the nodes, all of them have in the /etc/hosts the others ips and i can propagate the conf with ha_propagate. Thanks ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ClusterIP clone resource failover and migration issue
On Mon, Jun 6, 2011 at 10:22 PM, Randy Wilson randyedwil...@gmail.com wrote: Hi, I've setup two ClusterIP instances on a two node cluster using the below configuration: node node1.domain.com node node2.domain.com primitive clusterip_33 ocf:heartbeat:IPaddr2 \ params ip=xxx.xxx.xxx.33 cidr_netmask=27 nic=eth0:10 clusterip_hash=sourceip-sourceport-destport mac=01:XX:XX:XX:XX:XX primitive clusterip_34 ocf:heartbeat:IPaddr2 \ params ip=xxx.xxx.xxx.34 cidr_netmask=27 nic=eth0:11 clusterip_hash=sourceip-sourceport-destport mac=01:XX:XX:XX:XX:XX clone clone_clusterip_33 clusterip_33 \ meta globally-unique=true clone-max=2 clone-node-max=2 notify=true target-role=Started \ params resource-stickiness=0 clone clone_clusterip_34 clusterip_34 \ meta globally-unique=true clone-max=2 clone-node-max=2 notify=true target-role=Started \ params resource-stickiness=0 property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ stonith-enabled=false \ expected-quorum-votes=2 \ last-lrm-refresh=1307352624 The resources start up on each node, with the correct iptables rules being assigned. Last updated: Mon Jun 6 11:29:24 2011 Stack: openais Current DC: node1.domain.com - partition with quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ node1.domain.com node2.domain.com ] Clone Set: clone_clusterip_33 (unique) clusterip_33:0 (ocf::heartbeat:IPaddr2): Started node1.domain.com clusterip_33:1 (ocf::heartbeat:IPaddr2): Started node2.domain.com Clone Set: clone_clusterip_34 (unique) clusterip_34:0 (ocf::heartbeat:IPaddr2): Started node1.domain.com clusterip_34:1 (ocf::heartbeat:IPaddr2): Started node2.domain.com I receive an error whenever I attempt to migrate one of the resources, so that a single node handles all the ClusterIP traffic. crm(live)resource# migrate clusterip_33:1 node1.domain.com Error performing operation: Update does not conform to the configured schema/DTD You can't (yet) migrate individual instances. Although migrate clusterip_33 node1.domain.com might still do what you want. crm(live)resource# migrate clusterip_34:1 node1.domain.com Error performing operation: Update does not conform to the configured schema/DTD And when one of the nodes is taken offline, by stopping corosync, the resources are stopped on the remaining node and cannot be started without the other node being brought back online. Last updated: Mon Jun 6 12:42:21 2011 Stack: openais Current DC: node1.domain.com - partition WITHOUT quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ node1.domain.com ] OFFLINE: [ node2.domain.com ] If I add a colocation to the config: colocation coloc_clusterip inf: clone_clusterip_33 clone_clusterip_34 When the offline node is brought back up, all the resources are started on the other node. Last updated: Mon Jun 6 13:00:39 2011 Stack: openais Current DC: node1.domain.com - partition WITHOUT quorum Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ node1.domain.com node2.domain.com ] Clone Set: clone_clusterip_33 (unique) clusterip_33:0 (ocf::heartbeat:IPaddr2): Started node1.domain.com clusterip_33:1 (ocf::heartbeat:IPaddr2): Started node1.domain.com Clone Set: clone_clusterip_34 (unique) clusterip_34:0 (ocf::heartbeat:IPaddr2): Started node1.domain.com clusterip_34:1 (ocf::heartbeat:IPaddr2): Started node1.domain.com Can anyone see where I'm going wrong with this? Many thanks, REW ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] using the pacemaker logo for the xing group
On Tue, Jun 21, 2011 at 5:22 PM, Keisuke MORI keisuke.mori...@gmail.com wrote: Hi Erkan, As I've sent a personal email to you and as Ikeda-san already replied to you, Anybody may use the logo in conjunction with any Pacemaker / Linux-HA related projects. The logo is a contribution from the Japanese Pacemaker / Linux-HA community, so asking the permission to the Japanese mailing list as you did is right but here is also fine. Although its presence does imply some degree of official connection with the project, so its nice to ask here too :-) You can obtain the logo from here: (sorry it's in Japanese) http://linux-ha.sourceforge.jp/wp/archives/369 Regards, Keisuke MORI Linux-HA Japan Project. 2011/6/21 Junko IKEDA tsukishima...@gmail.com: Hi Erkan, The pacemaker logos has been created by NTT group. I asked for the boss's permission, I think I can send them to you directory soon :) Did you post the similar mail to the Japanese mailing list before this? Sorry to inconvenience you. Thanks, Junko IKEDA NTT DATA INTELLILINK CORPORATION 2011/6/20 erkan yanar erkan.ya...@linsenraum.de: Moin, I would like to use the (red/rabbit) pacemaker logo for the linux cluster group in xing. Who do I have to ask for permission to use it? Regards Erkan -- über den grenzen muß die freiheit wohl wolkenlos sein ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Keisuke MORI ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_report versus hb_report
On Fri, Jun 17, 2011 at 4:47 PM, alain.mou...@bull.net wrote: Hi, I just discover that on RH6 there is no more hb_report, it has been remove from cluster-glue rpm . Does the crm_report delivered in pacemaker rpm give the sames results as hb_report ? Yes. It re-uses much of the same gathering code but with a slightly revised design. Is there other tools for traces or is it sufficient in all cases for a Pacemaker/corosync stack ? Thanks Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm_report versus hb_report
On Fri, Jun 17, 2011 at 5:30 PM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Jun 17, 2011 at 4:47 PM, alain.mou...@bull.net wrote: Hi, I just discover that on RH6 there is no more hb_report, it has been remove from cluster-glue rpm . Does the crm_report delivered in pacemaker rpm give the sames results as hb_report ? Yes. It re-uses much of the same gathering code but with a slightly revised design. In fact its also flag compatible with hb_report. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Status about the four stack options
On Tue, May 24, 2011 at 9:54 AM, alain.mou...@bull.net wrote: Hi Many thanks for this status. I suppose this is the same status on RHEL6 as Suse is likely to be in advance with regard to RHEL6 Pacemaker corosync evolutions ? This is not implied. On RHEL6, although Pacemaker is not officially supported yet, people are encouraged to use option 2 or 3 due to the improved startup/shutdown reliability. I'd imagine option 4 is about a year or so away. Alain De : Lars Marowsky-Bree l...@suse.de A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 23/05/2011 13:10 Objet : Re: [Linux-HA] Status about the four stack options Envoyé par : linux-ha-boun...@lists.linux-ha.org On 2011-05-23T10:49:16, alain.mou...@bull.net wrote: Hi I just wonder the status of the 4 stack options : from which releases of Pacemaker corosync are the 3 and 4 options available ? and on which Distribution ? RHEL6 ? 1. corosync + pacemaker plugin (v0) This is what SUSE Linux Enterprise High-Availability Extension uses, and is fully supported. 2. corosync + pacemaker plugin (v1) + mcp We may switch to this at a later time during the SLE HA cycle. 3. corosync + cpg + cman + mcp 4. corosync + cpg + quorumd + mcp On SLE, we're bound to skip 3, but 4) is probably somewhere in the very late future, once it is fully stabilized and integrated. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocfs2
On Tue, May 24, 2011 at 2:33 PM, Eric Warnke ewar...@albany.edu wrote: Fedora 14 lacks dlm-pcmk since it has been depreciated. Really frustrating as whatprovides shows a file but yum install says nothing to do without installing. Most of the existing quick start docs are therefor inapplicable as they presume that you have dlm_controld.pcmk. https://www.redhat.com/archives/linux-cluster/2011-March/msg00084.html Overall I went back over a number of steps and found some errors and was able to get it up and running. You need to use pacemaker + cman and the regular *_controld daemons. See the 1.1 version of clusters from scratch up at clusterlabs. 1) Went back over and reconfigured cman + pacemaker without corosync 2) Since I had presumed that I would be integrating with pacemaker I had failed to install the ocfs2-tools-cman package 3) Somewhere along the way I setup the cluster.conf with the full hostname leading to all sorts of fun with pacemaker listing 6 nodes rather than three. crm configure erase nodes was able to clear that up once the cluster.conf files were stable. 4) Once those two things were stable I was able to bring up o2cb and ocfs2 clones under pacemaker ( my understanding is dlm is already up thanks to cman ). At this point I'll probably have to take a step back and try rebuilding this cluster to make sure I have the flow right. Am I correct in presuming that, short of membership and quorm in cman, pacemaker is where I configure STONITH and obviously all services? Cheers, Eric On 5/23/11 4:18 PM, asimonell...@gmail.com asimonell...@gmail.com wrote: I found the following link extremely useful for setting up a OCFS with OpenAIS/Corosync: http://www.novell.com/documentation/sle_ha -Anthony --Original Message-- From: Eric Warnke Sender: linux-ha-boun...@lists.linux-ha.org To: Linux-HA mailing list ReplyTo: General Linux-HA mailing list Subject: [Linux-HA] ocfs2 Sent: May 23, 2011 3:13 PM I have been chasing my tail all day trying to get a simple 3 node cluster to mount an ocfs2 filesystem over iscsi on Fedora 14. Up until this morning it was working wonderfully for testing HA NFSv4 where the filesystems were non-clustered xfs volumes. Is there any useful documentation for converting a simple corosync + pacemaker installation to being able to mount an ocfs2 filesystem? -Eric ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems Sent via BlackBerry from T-Mobile ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DO NOT start using heartbeat 2.x in crm mode, but just use Pacemaker, please! [was: managing resource httpd in heartbeat]
On Thu, May 19, 2011 at 12:36 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, May 18, 2011 at 05:21:35PM -0700, Vinay Nagrik wrote: Hello everybody, I am running Centos 5.2 with heartbeat 2.1.3 and we as a group run it on appliances and *it is readily not possible to suddenly upgrade heartbeat to a later version which runs pacemaker*. Oh yes you can. If you want to use the cib based crmd style configuration, you will have to. Why do you think you can not? You cannot change the facts by asking the same question again. But if it helps to get the message across, please, anyone that wants to, feel free to add Confirmed by The guy that wrote most of the stuff being discussed. Seriously, are we still having this conversation? Its two years since I wrote: http://www.mail-archive.com/linux-ha@lists.linux-ha.org/msg12684.html and its even more true now. to this thread. ;-) My requirement as a developer to explore possibilities to manage resource especially Apache web server in such a fashion that on a two node cluster if *on the active node the Apache service goes down then* the heartbeat should shut down that active node and transfer the control to other node, which has the Apache serviceable.. Could someone please tell me the literature, where I can get some configuration parameters. I know it can be done in a cib.xml file, which is part of heartbeat 2.1.3 and also some other utilities like cibadmin. If you, for what ever (non technical) reason you think you are stuck with heartbeat 2.1.x, stay with haresources, possibly add mon into your setup, and hope for the best. If you want to go cib and crm, you absolutely have to use Pacemaker. Whether you then use Pacemaker on top of heartbeat (3.0.x) or corosync is an other decision, and unaffected by this. But please accept that Pacemaker is the very same CRMD (cluster resource manager daemon) that came with heartbeat 2.1.x, only four (4!) years of bug fixing and development later. So if you insist on using some known to be buggy 4 year old piece of software, just because one component of it now goes by a different name, I'm sorry, but you should not expect much help. I think we told you that already? Still, you asked to be pointed to documentation. There is the section Documentation For Older Releases http://www.clusterlabs.org/wiki/Documentation Which links to some relevant docs about 2.1.x. Just in case you absolutely love to hurt yourself ;-) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Best way for colocating resource on a dual primary drbd
On Sat, May 14, 2011 at 9:31 AM, RaSca ra...@miamammausalinux.org wrote: Il giorno Ven 13 Mag 2011 16:09:14 CET, Viacheslav Biriukov ha scritto: In your case you have two drbd master. So, I think, it is not a good idea to create that collocation. Instead of this you can set location directives to locate vm-test_virtualdomain where you want to be default. For example: location L_vm-test_virtualdomain_01 vm-test_virtualdomain 100: master1.node location L_vm-test_virtualdomain_02 vm-test_virtualdomain 10: master2.node And I agree to your point of view (since I test that the colocation is not working). But the point is: why? I mean, the colocation defines that the drbd device must run in a node where drbd is Master. Why Pacemaker puts drbd in slave on the node in which the migration start? Does a colocation like this: colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: vm-test_virtualdomain vm-test_ms-r0:Master Implicit that once the resource go away it becomes slave? Pretty sure this is a bug in 1.0. Have you tried 1.1.5 ? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Behaviour when rebooting inactive node
On Mon, May 9, 2011 at 2:38 PM, Nicolas Guenette nicol...@ho.reitmans.com wrote: Hello, I have a two node cluster and a question about the cluster's behaviour when I reboot the inactive node. The situation is this: if the resources are running on serverA and I reboot serverB, serverA re-acquires the resources when it detects serverB leaving. Meaning, it actually re-runs my startup scripts! That's not good... If there any way to configure Linux-HA so that it doesn't behave like that in case its other node is rebooted? How can I stop the this re-acquiring of resources? Hard to say without some idea of what version you're running and what your config looks like. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Filesystem ocf file
On Fri, May 6, 2011 at 9:37 AM, Florian Haas florian.h...@linbit.com wrote: On 2011-05-06 09:26, Darren Thompson wrote: Team I was reviewing some errors on a cluster mounted file-system that caused me to review the Filesystem ocf file. I notice that it uses an undeclared parameter of OCF_CHECK_LEVEL to determine what degree of testing of the filesystem is required in monitor I have now updated it to more formally work with a check_level value with the more obvious values of mounted, read write ( my updated version attached ) Could someone (Florian is this something you can do?) please review this with a view to patching the upstream Filesystem ocf file. NACK, sorry. The OCF_CHECK_LEVEL is specific to the monitor action and described as such in the OCF spec; this will not be changed without a change to the spec. To use it, set op monitor interval=X OCF_CHECK_LEVEL=Y Yes, it's poorly designed, it makes no sense why this is pretty much the only sensible time to set a parameter specifically for an operation (as opposed to on a resource), it's inexplicable why it's all caps, etc., but that's the way it is. Honest. It was broken when we got here. Maybe it was the neighbor's dog? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [ha-wg] Cluster Stack - Ubuntu Developer Summit
On Thu, May 5, 2011 at 10:25 AM, Florian Haas florian.h...@linbit.com wrote: On 2011-04-26 19:33, Andres Rodriguez wrote: UDS' are open-to-public events, and I believe it would be great if upstream could participate and maybe even further the discussion about the Cluster Stack. For more information about UDS, please visit [1]. The specific date/time for the Cluster Stack session is not yet available. If you require any further information please don't hesitate to contact me. Andres already knows this, but FWIW I'll repost here that I'll be at UDS in time for the cluster stack session at 12 noon on 5/12. I'll stay in Budapest that evening and will probably join the Budapest sightseeing tour that the Hungarian Ubuntu team is organizing, so if anyone wants to link up with Andres and me for a few beverages please let us know. Andrew, interested in making a day trip to Budapest while you're still on this continent? With under 4 weeks to go - not a chance :-) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ACLs and privilege escalation (was Re: New OCF RA: symlink)
On Thu, May 5, 2011 at 9:09 AM, Florian Haas florian.h...@linbit.com wrote: Rather than going into ACLs in more detail, I wanted to highlight that however we limit access to the CIB, the resource agents still _execute_ as root, so we will always have what would normally be considered a privilege escalation issue. Now, we could agree on security guidelines for RAs, and some of those would certainly be no-brainers to define (such as, don't ever eval unsanitized user input), but I refuse to even suggest to tackle any such guidelines before the OCF spec update has gotten off the ground. One such thing that could be added to the spec would be optional meta variables named user and group, directing the LRM (or any successor) to execute the RA as that user rather than root. Just an idea. Seems plausible. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New OCF RA: symlink
On Wed, May 4, 2011 at 4:36 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: Services running under Pacemaker control are probably critical, so a malicious person with even only stop access on the CIB can do a DoS. I guess we have to assume people with any write access at all to the CIB are trusted, and not malicious. Exactly. If the cluster (or access to it) has been compromised, you're in for so much pain that a symlink RA is the least of your problems. A generic cluster manager is, by design, a way to run arbitrary scripts as root - there's no coming back from there. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] get haresources2cib.py
On Tue, May 3, 2011 at 9:50 PM, Vinay Nagrik vnag...@gmail.com wrote: Hello Andrew, We have been goint in small details and I still did not get any answer, which will put me on the right path. I apologise to ask you these questions. But these are important for my work. I have down loaded *eat-3-0-STABLE-3.0.4.tar.bz2* heartbeat != pacemaker and unzipped it. I looked for crm shell and any .dtd .DTD file and did not find any. Please please tell me where to get the crm shell or what are the steps or did I download a wrong .tar.bz2 file. There were these files as well glue-1.0.7.tar.bz2 http://hg.linux-ha.org/glue/archive/glue-1.0.7.tar.bz2 and *gents-1.0.4.tar.gz* Do I have to download these files also. My first and very first step is to create a cib.xml file. And I am running in small circles. Kindly help. I will greatly apprecite this. Thanks. arun On Mon, May 2, 2011 at 11:16 PM, Andrew Beekhof and...@beekhof.net wrote: On Mon, May 2, 2011 at 9:33 PM, Vinay Nagrik vnag...@gmail.com wrote: Thank you Andrew. Could you please tell me where to get the DTD for cib.xml and where from can I download crm shell. Both get installed with the rest of pacemaker thanks in anticipation. With best regards. nagrik On Mon, May 2, 2011 at 12:56 AM, Andrew Beekhof and...@beekhof.net wrote: On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote: Dear Andrew, I read your document clusters from scratch and found it very detailed. It gave lots of information, but I was looking for creating a cib.xml and could not decipher the language as to the syntex and different fields to be put in cib.xml. Don't look at the xml. Use the crm shell. I am still looking for the haresources2cib.py script. Don't. It only creates configurations conforming to the older and now unsupported syntax. I searched the web but could not find anywhere. I have 2 more questions. Do I have to create the cib.xml file on the nodes I am running heartbeat v.2 software. Does cib.xml has to reside in /var/lib/crm directory or can it reside anywhere else. Kindly provide these answers. I will greatly appreciate your help. Have a nice day. Thanks. nagrik On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net wrote: Forget the conversion. Use the crm shell to create one from scratch. And look for the clusters from scratch doc relevant to your version - its worth the read. On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote: Hello Group, Kindly tell me where can I download haresources2cib.py file from. Please also tell me can I convert haresources file on a node where I am not running high availability service and then can I copy the converted ..xml file in /var/lib/heartbeat directory on which I am running the high availability. Also does cib file must resiede under /var/lib/heartbeat directory or can it reside under any directory like under /etc. please let me know. I am just a beginner. Thanks in advance. -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org
Re: [Linux-HA] Filesystem do not start on Pacemaker-Cluster
On Tue, May 3, 2011 at 10:39 AM, KoJack kojac...@web.de wrote: Hi, i was trying to set up a pacemaker cluster. After I added all resources, the filesystem will not start at one Node. crm_verify -L -V crm_verify[30068]: 2011/05/03_10:35:39 WARN: unpack_rsc_op: Processing failed op WebFS:0_start_0 on apache01: unknown error (1) crm_verify[30068]: 2011/05/03_10:35:39 WARN: unpack_rsc_op: Processing failed op WebFS:0_stop_0 on apache01: unknown exec error (-2) crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness: Forcing WebFSClone away from apache01 after 100 failures (max=100) crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness: Forcing WebFSClone away from apache01 after 100 failures (max=100) crm_verify[30068]: 2011/05/03_10:35:39 WARN: common_apply_stickiness: Forcing WebFSClone away from apache01 after 100 failures (max=100) crm_verify[30068]: 2011/05/03_10:35:39 ERROR: clone_rsc_colocation_rh: Cannot interleave clone WebSiteClone and WebIP because they do not support the same number of resources per node crm_verify[30068]: 2011/05/03_10:35:39 ERROR: clone_rsc_colocation_rh: Cannot interleave clone WebSiteClone and WebIP because they do not support the same number of resources per node crm_verify[30068]: 2011/05/03_10:35:39 WARN: should_dump_input: Ignoring requirement that WebFS:0_stop_0 comeplete before WebFSClone_stopped_0: unmanaged failed resources cannot prevent clone shutdown Errors found during check: config not valid crm configure show node apache01 node apache02 primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=10.1.1.5 cidr_netmask=8 nic=eth0 clusterip_hash=sourceip \ op monitor interval=30s primitive WebData ocf:linbit:drbd \ params drbd_resource=wwwdata \ op monitor interval=60s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s primitive WebFS ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/wwwdata directory=/var/www/html fstype=gfs2 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive WebSite ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf \ op monitor interval=1min \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s primitive dlm ocf:pacemaker:controld \ op monitor interval=120s \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s primitive gfs-control ocf:pacemaker:controld \ params daemon=gfs_controld.pcmk args=-g 0 \ op monitor interval=120s \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s ms WebDataClone WebData \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone WebFSClone WebFS clone WebIP ClusterIP \ meta globally-unique=true clone-max=2 clone-node-max=2 clone WebSiteClone WebSite clone dlm-clone dlm \ meta interleave=true clone gfs-clone gfs-control \ meta interleave=true colocation WebFS-with-gfs-control inf: WebFSClone gfs-clone colocation WebSite-with-WebFS inf: WebSiteClone WebFSClone colocation fs_on_drbd inf: WebFSClone WebDataClone:Master colocation gfs-with-dlm inf: gfs-clone dlm-clone colocation website-with-ip inf: WebSiteClone WebIP order WebFS-after-WebData inf: WebDataClone:promote WebFSClone:start order WebSite-after-WebFS inf: WebFSClone WebSiteClone order apache-after-ip inf: WebIP WebSiteClone order start-WebFS-after-gfs-control inf: gfs-clone WebFSClone order start-gfs-after-dlm inf: dlm-clone gfs-clone property $id=cib-bootstrap-options \ dc-version=1.1.4-ac608e3491c7dfc3b3e3c36d966ae9b016f77065 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 Did you see any mistake in my configuration? You mean apart from the 2 errors and the apache resource that cant stop in the crm_verify output? Thanks a lot -- View this message in context: http://old.nabble.com/Filesystem-do-not-start-on-Pacemaker-Cluster-tp31530410p31530410.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] get haresources2cib.py
On Mon, May 2, 2011 at 9:33 PM, Vinay Nagrik vnag...@gmail.com wrote: Thank you Andrew. Could you please tell me where to get the DTD for cib.xml and where from can I download crm shell. Both get installed with the rest of pacemaker thanks in anticipation. With best regards. nagrik On Mon, May 2, 2011 at 12:56 AM, Andrew Beekhof and...@beekhof.net wrote: On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote: Dear Andrew, I read your document clusters from scratch and found it very detailed. It gave lots of information, but I was looking for creating a cib.xml and could not decipher the language as to the syntex and different fields to be put in cib.xml. Don't look at the xml. Use the crm shell. I am still looking for the haresources2cib.py script. Don't. It only creates configurations conforming to the older and now unsupported syntax. I searched the web but could not find anywhere. I have 2 more questions. Do I have to create the cib.xml file on the nodes I am running heartbeat v.2 software. Does cib.xml has to reside in /var/lib/crm directory or can it reside anywhere else. Kindly provide these answers. I will greatly appreciate your help. Have a nice day. Thanks. nagrik On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net wrote: Forget the conversion. Use the crm shell to create one from scratch. And look for the clusters from scratch doc relevant to your version - its worth the read. On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote: Hello Group, Kindly tell me where can I download haresources2cib.py file from. Please also tell me can I convert haresources file on a node where I am not running high availability service and then can I copy the converted ..xml file in /var/lib/heartbeat directory on which I am running the high availability. Also does cib file must resiede under /var/lib/heartbeat directory or can it reside under any directory like under /etc. please let me know. I am just a beginner. Thanks in advance. -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen
On Mon, May 2, 2011 at 5:29 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Mon, May 02, 2011 at 04:04:56PM +0200, Andrew Beekhof wrote: Still, we may get a spurious failover in this case: reachability: +__ Node A monitoring intervals: + - + + + - - - - - Node B monitoring intervals: + + - + + - - - - - dampening interval: |-| Note how the dampening helps to ignore the first network glitch. But for the permanent network problem, we may get spurious failover: Then your dampen setting is too short or interval too long :-) No. Regardless of dampen and interval setting. Unless both nodes notice the change at the exact same time, expire their dampen at the exact same time, This is where you've diverged. Once dampen expires on one node, _all_ nodes write their current value. and place their updated values into the CIB at exactly the same time. If a ping node just dies, then one node will always notice it first. And regardless of dampen and interval settings, one will reach the CIB first, and therefor the PE will see the connectivity change first for only one of the nodes, and only later for the other (once it noticed, *and* expired its dampen interval, too). Show me how you can work around that using dampen or interval settings. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] get haresources2cib.py
On Sun, May 1, 2011 at 9:26 PM, Vinay Nagrik vnag...@gmail.com wrote: Dear Andrew, I read your document clusters from scratch and found it very detailed. It gave lots of information, but I was looking for creating a cib.xml and could not decipher the language as to the syntex and different fields to be put in cib.xml. Don't look at the xml. Use the crm shell. I am still looking for the haresources2cib.py script. Don't. It only creates configurations conforming to the older and now unsupported syntax. I searched the web but could not find anywhere. I have 2 more questions. Do I have to create the cib.xml file on the nodes I am running heartbeat v.2 software. Does cib.xml has to reside in /var/lib/crm directory or can it reside anywhere else. Kindly provide these answers. I will greatly appreciate your help. Have a nice day. Thanks. nagrik On Sat, Apr 30, 2011 at 1:32 AM, Andrew Beekhof and...@beekhof.net wrote: Forget the conversion. Use the crm shell to create one from scratch. And look for the clusters from scratch doc relevant to your version - its worth the read. On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote: Hello Group, Kindly tell me where can I download haresources2cib.py file from. Please also tell me can I convert haresources file on a node where I am not running high availability service and then can I copy the converted ..xml file in /var/lib/heartbeat directory on which I am running the high availability. Also does cib file must resiede under /var/lib/heartbeat directory or can it reside under any directory like under /etc. please let me know. I am just a beginner. Thanks in advance. -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen
On Mon, May 2, 2011 at 8:27 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Andrew Beekhof and...@beekhof.net schrieb am 29.04.2011 um 09:31 in Nachricht BANLkTi=-ftyk9uxcgu0m2wqhquu_rt8...@mail.gmail.com: On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net wrote: It waits $dampen before changes are pushed to the cib. So that eventually occuring icmp hickups do not produce an unintended failover. At least that's my understanding. correcto Hi! Strange: So the update is basically just delayed by that amount of time? I see no advantage: If you put a bad value to the CIB immediately or after some delay, the value won't get better by that. Damping siggests some filtering to me, but you are saying your are not filtering the values, but just delaying them. Right? Only the current value is written. So the cluster will tolerate minor outages provided they last for less than the dampen interval and the monitor frequency is high enough. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: ocf:pacemaker:ping: dampen
On Mon, May 2, 2011 at 3:51 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Mon, May 02, 2011 at 01:20:16PM +0200, Andrew Beekhof wrote: On Mon, May 2, 2011 at 8:27 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Andrew Beekhof and...@beekhof.net schrieb am 29.04.2011 um 09:31 in Nachricht BANLkTi=-ftyk9uxcgu0m2wqhquu_rt8...@mail.gmail.com: On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net wrote: It waits $dampen before changes are pushed to the cib. So that eventually occuring icmp hickups do not produce an unintended failover. At least that's my understanding. correcto Hi! Strange: So the update is basically just delayed by that amount of time? I see no advantage: If you put a bad value to the CIB immediately or after some delay, the value won't get better by that. Damping siggests some filtering to me, but you are saying your are not filtering the values, but just delaying them. Right? Only the current value is written. So the cluster will tolerate minor outages provided they last for less than the dampen interval and the monitor frequency is high enough. Still, we may get a spurious failover in this case: reachability: +__ Node A monitoring intervals: + - + + + - - - - - Node B monitoring intervals: + + - + + - - - - - dampening interval: |-| Note how the dampening helps to ignore the first network glitch. But for the permanent network problem, we may get spurious failover: Then your dampen setting is too short or interval too long :-) One dampening interval after node B notices loss of reachability, it will trigger a PE run, potentially moving things from B to A, because on A, the reachability (in the CIB) is still ok. Shortly thereafter, the dampening interval on A also expires, and the CIB will be updated with A cannot reach out there either. Any resource migrations triggered by B cannot reach out there are now recognized as spurious. Question is, how could we avoid them? ipfail used to ask the peer, wait for the peer to notice the new situation as well, and only then trigger actions. We could possibly store a short history of values, and actually do some filtering arithmetic with them. Not sure if this should be done inside or outside of the CIB. Probably outside. Yes, outside :-) One of these attrd needs a rewrite :-( ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] get haresources2cib.py
Forget the conversion. Use the crm shell to create one from scratch. And look for the clusters from scratch doc relevant to your version - its worth the read. On Sat, Apr 30, 2011 at 1:19 AM, Vinay Nagrik vnag...@gmail.com wrote: Hello Group, Kindly tell me where can I download haresources2cib.py file from. Please also tell me can I convert haresources file on a node where I am not running high availability service and then can I copy the converted ..xml file in /var/lib/heartbeat directory on which I am running the high availability. Also does cib file must resiede under /var/lib/heartbeat directory or can it reside under any directory like under /etc. please let me know. I am just a beginner. Thanks in advance. -- Thanks Nagrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:pacemaker:ping: dampen
On Fri, Apr 29, 2011 at 9:27 AM, Dominik Klein d...@in-telegence.net wrote: It waits $dampen before changes are pushed to the cib. So that eventually occuring icmp hickups do not produce an unintended failover. At least that's my understanding. correcto ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd does not react as expected = split brain
On Wed, Apr 27, 2011 at 7:18 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi Andrew, According to your configuration, it can be up to 60s before we'll detect a change in external connectivity. Thats plenty of time for the cluster to start resources. Maybe shortening the monitor interval will help you. TNX for the suggestion, I'll try that. Any suggestions on recommended monitor intervals for pingd? Couldn't hurt. Hm... if I - for example, set the monitor interval to 10s, I'd have to adjust the timeout for monitor to 10s as well, right? Right. Ping is quite sluggish, it takes up to 30s to check the three nodes. Sounds like something is misconfigured. If I now adust the interval to 10s, the next check might be triggered before the last one is complete. Will this confuse pacemaker? No. The next op will happen 10s after the last finishes. Yes, and there is no proper way to use DRBD in a three node cluster. How is one related to the other? No-one said the third node had to run anything. Ok, thanks for the info; I thought all members of the cluster had to be able to run cluster resources. I would have to keep resources from trying to run on the third node then via a location constrain, right? Or node standby. TNX for your input! Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese
On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote: Hi Junko-san, On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote: Hi, May I suggest that you go with the devel version, because crm_cli.txt was converted to crm.8.txt. There are not many textual changes, just some obsolete parts removed. OK, I got crm.8.txt from devel. Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit different. Does 1.0 keep its doc dir structure for now? Until the next release I guess. If so, it seems that just create html file is not so difficult when asciidoc is available. No, not difficult. It just depends on the build environment. If asciidoc is found by configure, then it is going to be used to produce the html files. Do any distros _not_ ship asciidoc? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Translate crm_cli.txt to Japanese
On Wed, Apr 27, 2011 at 3:47 PM, Dejan Muhamedagic de...@suse.de wrote: On Wed, Apr 27, 2011 at 02:01:40PM +0200, Andrew Beekhof wrote: On Wed, Apr 27, 2011 at 12:54 PM, Dejan Muhamedagic de...@suse.de wrote: Hi Junko-san, On Wed, Apr 27, 2011 at 06:42:52PM +0900, Junko IKEDA wrote: Hi, May I suggest that you go with the devel version, because crm_cli.txt was converted to crm.8.txt. There are not many textual changes, just some obsolete parts removed. OK, I got crm.8.txt from devel. Each directory structure for Pacemaker 1.0,1.1 and devel is just a bit different. Does 1.0 keep its doc dir structure for now? Until the next release I guess. If so, it seems that just create html file is not so difficult when asciidoc is available. No, not difficult. It just depends on the build environment. If asciidoc is found by configure, then it is going to be used to produce the html files. Do any distros _not_ ship asciidoc? AFAIK none of contemporary distributions. And going back three years or so, it's the other way around. How quickly we forget. Anyway, I advocate that the project makes decisions based on it being around (but fails gracefully when its not) and leaves it up to older distros to ship a pre-generated copy if they so desire. I can't imagine lack of HTML versions being a deal breaker. And by fail gracefully, I mean the current behavior of just not building those versions of the doc. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Problem using Stonith external/ipmi device
On Tue, Apr 26, 2011 at 9:07 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Tue, Apr 19, 2011 at 02:46:06PM +0200, Andrew Beekhof wrote: On Tue, Apr 19, 2011 at 12:43 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 11, 2011 at 09:41:12AM +0200, Andrew Beekhof wrote: On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson m.richard...@ed.ac.uk wrote: On 07/04/11 16:36, Dejan Muhamedagic wrote: For whatever reason stonith-ng doesn't think that stonithipmidisk1 can manage this node. Which version of Pacemaker do you run? Perhaps this has been fixed in the meantime. I cannot recall right now if there has been such a problem, but it's possible. You can also try to turn debug on and see if there are more clues. I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on el5. I've tried turning on debug, but there's no more information coming out in the logs. man stonithd has the bits you need. start with pcmk_host_check That defaults to dynamic-list which should query the resource. Right? Right. Apparently, something's not quite ok there. the list command doesn't work perhaps? Yes, it does work. And it's been working since forever, as you know. I'm not sure how I would know this, I've never used an ipmi device. Unless there's something wrong with the installation. Whatever happened here? Matthew? Thanks, Dejan BTW, I've been doing tests with external/ssh and it did work fine. also fine with fence_xvm ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd does not react as expected = split brain
On Wed, Apr 27, 2011 at 11:52 AM, Stallmann, Andreas astallm...@conet.de wrote: Hi Lars, Hi Lars! You are exercising complete cluster communication loss. Which is cluster split brain. Correct, yes. If you are specifically exercising cluster split brain, why are you surprised that you get exactly that? Because ping(d) is supposed to keep ressources from starting on nodes which are not properly connected to the network. Thus: Still split brain, but no possibility for concurrent (and possibly damaging) access to resources. According to your configuration, it can be up to 60s before we'll detect a change in external connectivity. Thats plenty of time for the cluster to start resources. Maybe shortening the monitor interval will help you. Couldn't hurt. You need to reduce the probability to run into complete communication loss, by - using multiple communication links. There will be *one* dedicated (mpls) line between the two sites. No possibility for any real redundant links; honestly, believe me. The only way would be the usage of GSM modems or other wireless links, which is not possible for several other reasons (which I can't discuss here). - using a real quorum (there is no quorum in a two node failover cluster) Yes, and there is no proper way to use DRBD in a three node cluster. How is one related to the other? No-one said the third node had to run anything. Until then (unless we have a dedicated, replicated, shared storage, which we don't have, unfortunately), it's a two node cluster or nothing. This - inevitably - leads to the need for an external quorum, and ping(d) seems to do that, as far as I understood the docs. Please correct me if I'm wrong. You may want to still guard against the ugly effects of cluster split brain, by - implementing stonith - configuring stonith properly There's no proper way for doing stonith in a split-site scenario, besides meatware. If the link is down between the two sites, you won't be able to access any ILO, UPS or other stonith device. - additionally configuring fencing in DRBD Yes, I'm going to try that. Still: Please tell me if ping(d) is behaving properly or if it isn't. You've seen my configuration. I think it should work (and, indeed, it did a while ago; it could well be that we misconfigured something after that, but I just can't find what it is... THANKS, Andreas -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Bug in crm shell or pengine
On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote: Ok, I've read the documentation. It's not a bug, it's a feature :-) Might be nice if the shell could somehow prevent such configs, but it would be non-trivial to implement. On Mon, Apr 18, 2011 at 3:01 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - Looks like there is a bug in crm shell Pacemaker version 1.1.5 or in pengine. primitive pg_drbd ocf:linbit:drbd \ params drbd_resource=drbd0 \ op monitor interval=60s role=Master timeout=10s \ op monitor interval=60s role=Slave timeout=10s Log file: Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 crmd: [5535]: info: do_state_transition: Starting PEngine Recheck Timer Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Plus strange behavior of the cluster like inability to mover resources from one node to another. -- Serge Dubrouski. -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Problem using Stonith external/ipmi device
On Tue, Apr 19, 2011 at 12:43 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 11, 2011 at 09:41:12AM +0200, Andrew Beekhof wrote: On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson m.richard...@ed.ac.uk wrote: On 07/04/11 16:36, Dejan Muhamedagic wrote: For whatever reason stonith-ng doesn't think that stonithipmidisk1 can manage this node. Which version of Pacemaker do you run? Perhaps this has been fixed in the meantime. I cannot recall right now if there has been such a problem, but it's possible. You can also try to turn debug on and see if there are more clues. I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on el5. I've tried turning on debug, but there's no more information coming out in the logs. man stonithd has the bits you need. start with pcmk_host_check That defaults to dynamic-list which should query the resource. Right? Right. Apparently, something's not quite ok there. the list command doesn't work perhaps? BTW, I've been doing tests with external/ssh and it did work fine. also fine with fence_xvm ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How can I add more options to current reset, on, off options?
On Sun, Apr 17, 2011 at 11:23 PM, Avestan babak_khoram...@hotmail.com wrote: Hello, I am using STONITH Device AP9225EXP with AP9617 Network Management card. I have generated my own pacth to change the apcmaster.c file to work with my setup. The stonith appears to allow only three commands (reset - on - off) to be passed on to the STONITH Device using numbers 1, 2, 3 which reflected the selection of option 1, 2, and 3 on the outlet control console as show here: [root@dizin stonith]# stonith usage: stonith [-svh] -L stonith -n -t stonith-device-type stonith [-svh] -t stonith-device-type [-p stonith-device-parameters | -F stonith-device-parameters-file] -lS stonith [-svh] -t stonith-device-type [-p stonith-device-parameters | -F stonith-device-parameters-file] -T {reset|on|off} nodename where: -L list supported stonith device types -l list hosts controlled by this stonith device -S report stonith device status -s silent -v verbose -n output the config names of stonith-device-parameters -h display detailed help message with stonith device desriptions [root@dizin stonith]# The question is how I can add more possibilities to the list as the AP9225EXP is capable of doing more and I would like to take advantage of it. You mostly can't. But you can change what action your stonith agent performs when it receives one of the allowed values. Here is what AP9225EXP offers at it own outlet control console: 1 --- Outlet Control 1:5 Outlet Name : monitor Outlet State: ON Control Mode: Graceful Shutdown 1- Immediate On 2- Delayed On 3- Immediate Off 4- Immediate Reboot 5- Graceful Reboot 6- Shutdown 7- Override 8- Cancel ?- Help, ESC- Back, ENTER- Refresh, CTRL-L- Event Log Can someone tell me which file it is that I need to manipulate in order to increase the available options to the Ap9225EXP possibilities? Thanks, Avestan -- View this message in context: http://old.nabble.com/How-can-I-add-more-options-to-current-reset%2C-on%2C-off-options--tp31419415p31419415.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Shutdown Escalation
On Sat, Apr 16, 2011 at 12:30 PM, yash er.bhara...@yahoo.in wrote: yash er.bharat09 at yahoo.in writes: Andrew Beekhof andrew at beekhof.net writes: I am facing problem during heartbeat stop command as it hangs and returns after long 20 min, through google i came to know about shutdown escalation parameter of crmd but when i try reducing this parameter it do not read the configuration right and again uses 20 min default value. IIRC the name of some of the parameters changed slightly in the last 5 years. But I don't have such versions around to say for sure. Try replacing any '-' characters with '_' thanks for reply i will try this parameter and let u know... it is still taking the same value 20 min. in CIB i am using this parameter as part of cib file crm_config cluster_property_set id=4e816a85-e6a7-4844-af58-e16f595f1885 attributes nvpair id=1 name=default_resource_stickiness value=INFINITY/ nvpair name=no_quorum_policy id=a6eb4bbe-c1e2-4ac4-928c- a0f881a6f46c value=ignore/ /attributes /cluster_property_set cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-shutdown_escalation name=shutdown_escalation value=5min/ /attributes /cluster_property_set /crm_config is this right way to use this parameter Looks right. You might be experiencing a 5 year old bug. Definitely time to upgrade. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Dovecot OCF Resource Agent
On Fri, Apr 15, 2011 at 12:53 PM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 04/15/2011 11:10 AM, jer...@intuxicated.org wrote: Yes, it does the same thing but contains some additional features, like logging into a mailbox. first of all, i do not know how the others think about a ocf ra implemented in c. i'll suggest waiting for comments from dejan or fghass. the ipv6addr agent was written in C too the OCF standard does not dictate the language to be used - its really a matter of whether C is the best tool for this job you could then create a fork on github and make sure it integrates well with the current build environment. second, what do you think about extending this ra to be able to handle multiple email MDAs? deep probing routines would also be needed for other MDAs. i'm thinking about giving this ra a shot but would like to hear some comments on my first remark before doing so. thanks for your work! raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email. off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax. +43 1 3670030 15 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Shutdown Escalation
On Fri, Apr 15, 2011 at 8:44 AM, yash er.bhara...@yahoo.in wrote: Hello list, I am facing problem during heartbeat stop command as it hangs and returns after long 20 min, through google i came to know about shutdown escalation parameter of crmd but when i try reducing this parameter it do not read the configuration right and again uses 20 min default value. IIRC the name of some of the parameters changed slightly in the last 5 years. But I don't have such versions around to say for sure. Try replacing any '-' characters with '_' I have 3 node cluster and to reproduce the heartbeat hang issue i have used one script. Script follows: sleep 50sec stop heartbeat sleep 30 sec start heartbeat is it right behaviour for heartbeat 2.0.5 using crm mode or is there any other option to stop this hang issue.. any help.. Regards Yash ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question about lsb init script
Probably not lsb compliant. http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html On Wed, Apr 13, 2011 at 10:58 PM, Gerry Kernan gerry.ker...@infinityit.ie wrote: Hi I am setting up a asterisk HA solution using a redfone device for the PRI lines. To start the device i need to run /etc/init.d/fonulator.init i have added this as a resource but it won't start and give an error as below on crm status output. The config is below , hopefully someone can point out where i am going wrong. res_fonulator.init_fonulator (lsb:fonulator.init): Started ho-asterisk2-11314.interlink.local (unmanaged) FAILED out of crm configure show node $id=a7314e15-8bb1-4b2e-a732-888db0c7b7d7 ho-asterisk1-11315.interlink.local node $id=c0630b83-3c16-49c7-a55a-2c65ea0155ed ho-asterisk2-11314.interlink.local primitive res_Filesystem_1 ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/rep/ fstype=ext3 \ operations $id=res_Filesystem_1-operations \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 \ op monitor interval=20 timeout=40 start-delay=0 primitive res_IPaddr2_IPaddr ocf:heartbeat:IPaddr2 \ params ip=10.1.2.98 nic=eth0 cidr_netmask=24 \ operations $id=res_IPaddr2_IPaddr-operations \ op start interval=0 timeout=20 \ op stop interval=0 timeout=20 \ op monitor interval=10 timeout=20 start-delay=0 primitive res_dahdi_dahdi lsb:dahdi \ operations $id=res_dahdi_dahdi-operations \ op start interval=0 timeout=15 \ op stop interval=0 timeout=15 \ op monitor interval=15 timeout=15 start-delay=15 primitive res_drbd_1 ocf:linbit:drbd \ params drbd_resource=asterisk \ operations $id=res_drbd_1-operations \ op start interval=0 timeout=240 \ op promote interval=0 timeout=90 \ op demote interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=10 timeout=20 start-delay=0 primitive res_fonulator.init_fonulator lsb:fonulator.init \ operations $id=res_fonulator.init_fonulator-operations \ op start interval=0 timeout=15 \ op stop interval=0 timeout=15 \ op monitor interval=15 timeout=15 start-delay=15 primitive res_httpd_httpd lsb:httpd \ operations $id=res_httpd_httpd-operations \ op start interval=0 timeout=15 \ op stop interval=0 timeout=15 \ op monitor interval=15 timeout=15 start-delay=15 primitive res_mysqld_mysql lsb:mysqld \ operations $id=res_mysqld_mysql-operations \ op start interval=0 timeout=15 \ op stop interval=0 timeout=15 \ op monitor interval=15 timeout=15 start-delay=15 ms ms_drbd_1 res_drbd_1 \ meta clone-max=2 notify=true colocation col_res_Filesystem_1_ms_drbd_1 inf: res_Filesystem_1 ms_drbd_1:Master order ord_ms_drbd_1_res_Filesystem_1 inf: ms_drbd_1:promote res_Filesystem_1:start property $id=cib-bootstrap-options \ default-resource-stickiness=100 \ stonith-enabled=false \ stonith-action=poweroff \ dc-version=1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 \ default-resource-failure-stickiness=100 \ no-quorum-policy=ignore \ cluster-infrastructure=Heartbeat \ last-lrm-refresh=1302723679 /etc/init.d/fonutalor.init #!/bin/bash # # fonulator Starts and Stops the Redfone fonulator utility # # chkconfig: - 60 50 # description: Utility for configuring the Redfone fonebridge # # processname: fonulator # config: /etc/redfone.conf # Source function library. . /etc/rc.d/init.d/functions # Source networking configuration. . /etc/sysconfig/network # Check that networking is up. [ ${NETWORKING} = no ] exit 0 [ -x /usr/local/bin/fonulator ] || exit 0 RETVAL=0 prog=fonulator start() { # Start daemons. if [ -d /etc/ ] ; then for i in `ls /etc/redfone.conf`; do site=`basename $i .conf` echo -n $Starting $prog for $site: /usr/local/bin/fonulator $i RETVAL=$? [ $RETVAL -eq 0 ] { touch /var/lock/subsys/$prog success $$prog $site } echo done else RETVAL=1 fi return $RETVAL } stop() { # Stop daemons. echo -n $Shutting down $prog: killproc $prog RETVAL=$? echo [ $RETVAL -eq 0 ]
Re: [Linux-HA] question about lsb init script
On Thu, Apr 14, 2011 at 11:18 AM, Gerry Kernan gerry.ker...@infinityit.ie wrote: Andrew, Thanks, I've do some checking and it doesn't appear to be. Can I add a resource that runs a command and doesn't look for a status for the resource. No. The status operation is required to be implemented by the script. Any script that does not is also not an LSB init script. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource-Group won't start - crm_mon does not react - no failures shown
On Wed, Apr 13, 2011 at 12:23 AM, Stallmann, Andreas astallm...@conet.de wrote: Hi! We've got a pretty straightforward and easy configuration: Corosync 1.2.1 / Pacemaker 2.0.0 on OpenSuSE 11.3 running DRBD (M/S), Ping (clone), and a resource-group, containing a shared IP, tomcat and mysql (where the datafiles of mysql reside on the DRBD). The cluster consists of two virtual machines running on VMware ESXi 4. Since we moved the cluster to an other vmware esxi, strange things happen: While DRBD and the ping resource come up on both nodes, the resource group appl_grp (see below) doesn't. No failures are shown in crm_mon and the failcount is zero. Is host_list=191.224.111.1 191.224.111.78 194.25.2.129 still valid? If no value is being set for pingd then I can imagine this would be the result. Output of crm_mon: ~ Last updated: Tue Apr 12 23:39:39 2011 Stack: openais Current DC: cms-appl02 - partition with quorum Version: 1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 2 Nodes configured, 2 expected votes 3 Resources configured. Online: [ cms-appl01 cms-appl02 ] Master/Slave Set: ms_drbd_r0 Masters: [ cms-appl01 ] Slaves: [ cms-appl02 ] Clone Set: pingy_clone Started: [ cms-appl01 cms-appl02 ] ~~ Normally, I'd at least saw the resource group as stoped, but now it doesn't even turn up in the crm_mon display! The crm-Tool at least shows, that the resources still exist: ~~ crm(live)# resource crm(live)resource# show Resource Group: appl_grp fs_r0 (ocf::heartbeat:Filesystem) Stopped sharedIP (ocf::heartbeat:IPaddr2) Stopped tomcat_res (ocf::heartbeat:tomcat) Stopped database_res (ocf::heartbeat:mysql) Stopped Master/Slave Set: ms_drbd_r0 Masters: [ cms-appl01 ] Slaves: [ cms-appl02 ] Clone Set: pingy_clone Started: [ cms-appl01 cms-appl02 ] ~~~ And finally, here's our configuration: ~~output of crm configure show node cms-appl01 node cms-appl02 primitive database_res ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf datadir=/drbd/mysql user=mysql log=/var/log/mysql/mysqld.logpid=/var/run/mysql/mysqld.pid socket=/drbd/run/mysql/mysql.sock \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=10s timeout=30s \ op notify interval=0 timeout=90s primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=15s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s primitive fs_r0 ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/drbd fstype=ext4 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive pingy_res ocf:pacemaker:ping \ params dampen=5s multiplier=1000 host_list=191.224.111.1 191.224.111.78 194.25.2.129 \ op monitor interval=60s timeout=60s \ op start interval=0 timeout=60s primitive sharedIP ocf:heartbeat:IPaddr2 \ params ip=191.224.111.50 cidr_netmask=255.255.255.0 nic=eth0:0 primitive tomcat_res ocf:heartbeat:tomcat \ params java_home=/etc/alternatives/jre \ params catalina_home=/usr/share/tomcat6 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=120s \ op monitor interval=10s timeout=30s group appl_grp fs_r0 sharedIP tomcat_res database_res \ meta target-role=Started ms ms_drbd_r0 drbd_r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone pingy_clone pingy_res location appl_loc appl_grp 100: cms-appl01 location only-if-connected appl_grp \ rule $id=only-if-connected-rule -inf: not_defined pingd or pingd lte 2000 colocation appl_grp-only-on-master inf: appl_grp ms_drbd_r0:Master order appl_grp-after-drbd inf: ms_drbd_r0:promote appl_grp:start order mysql-after-fs inf: fs_r0 database_res property $id=cib-bootstrap-options \ stonith-enabled=false \ no-quorum-policy=ignore \ stonith-action=poweroff \ default-resource-stickiness=100 \ dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ last-lrm-refresh=1302643565 ~ When I (re)activate the appl_grp, literarily nothing happens: crm(live)resource# start nag_grp No new entries in /var/log/messages, no visible changes in crm_mon. It is as if the resource didn't exist. Any ideas? You'll find the logs below. Cheers and good night, Andreas I found only one error message in /var/log/messages:
Re: [Linux-HA] Filesystem thinks it is run as a clone
On Tue, Apr 12, 2011 at 5:17 PM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: Hi, today we tested some NFS cluster scenarios and the first test failed. The first test was to put the current master node into standby. Stopping the services worked but then starting it on the other node failed. The ocf:hearbeat:Filesystem resource failed to start. In the logfile we see: Apr 12 14:08:42 laplace Filesystem[10772]: [10820]: INFO: Running start for /dev/home-data/afs on /srv/nfs/afs Apr 12 14:08:42 laplace Filesystem[10772]: [10822]: ERROR: DANGER! ext4 on /dev/home-data/afs is NOT cluster-aware! Apr 12 14:08:42 laplace Filesystem[10772]: [10824]: ERROR: DO NOT RUN IT AS A CLONE! To my eye the Filesystem agent looks confused The message comes from the following code in ocf:hearbeat:Filesystem: case $FSTYPE in ocfs2) ocfs2_init ;; nfs|smbfs|none|gfs2) : # this is kind of safe too ;; *) if [ -n $OCF_RESKEY_CRM_meta_clone ]; then ocf_log err DANGER! $FSTYPE on $DEVICE is NOT cluster-aware! ocf_log err DO NOT RUN IT AS A CLONE! ocf_log err Politely refusing to proceed to avoid data corruption. exit $OCF_ERR_CONFIGURED fi ;; esac The message is only printed if the variable OCF_RESKEY_CRM_meta_clone is non-zero. Our configuration however does not run the filesystem as a clone. Somehow the OCF_RESKEY_CRM_meta_clone variable leaked into the start of the Filesystem resource. Is this a known bug? Or is there a configuration error on our side? Here is the current configuration: node laplace \ attributes standby=off node ries \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.143.228 cidr_netmask=24 \ op monitor interval=30s \ meta target-role=Started primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=home-data \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_afs ocf:heartbeat:exportfs \ params fsid=1 directory=/srv/nfs/afs options=rw,no_root_squash,mountpoint \ clientspec=192.168.143.0/255.255.255.0 \ wait_for_leasetime_on_stop=false \ op monitor interval=30s primitive p_fs_afs ocf:heartbeat:Filesystem \ params device=/dev/home-data/afs directory=/srv/nfs/afs \ fstype=ext4 \ op monitor interval=10s \ meta target-role=Started primitive p_lsb_nfsserver lsb:nfs-kernel-server \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=home-data \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_lsb_nfsserver p_lsb_nfsserver colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.0.9-unknown \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1302610197 rsc_defaults $id=rsc-options \ resource-stickiness=200 Thanks Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Filesystem thinks it is run as a clone
On Wed, Apr 13, 2011 at 10:57 AM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: Am 13.04.2011 08:26, schrieb Andrew Beekhof: On Tue, Apr 12, 2011 at 5:17 PM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: Hi, today we tested some NFS cluster scenarios and the first test failed. The first test was to put the current master node into standby. Stopping the services worked but then starting it on the other node failed. The ocf:hearbeat:Filesystem resource failed to start. In the logfile we see: Apr 12 14:08:42 laplace Filesystem[10772]: [10820]: INFO: Running start for /dev/home-data/afs on /srv/nfs/afs Apr 12 14:08:42 laplace Filesystem[10772]: [10822]: ERROR: DANGER! ext4 on /dev/home-data/afs is NOT cluster-aware! Apr 12 14:08:42 laplace Filesystem[10772]: [10824]: ERROR: DO NOT RUN IT AS A CLONE! To my eye the Filesystem agent looks confused The agent is confused because OCF_RESKEY_CRM_meta_clone is non-zero. Is this something that can happen? Not unless the resource has been cloned - and looking at the config this did not seem to be the case. Or did I miss something? The message comes from the following code in ocf:hearbeat:Filesystem: case $FSTYPE in ocfs2) ocfs2_init ;; nfs|smbfs|none|gfs2) : # this is kind of safe too ;; *) if [ -n $OCF_RESKEY_CRM_meta_clone ]; then ocf_log err DANGER! $FSTYPE on $DEVICE is NOT cluster-aware! ocf_log err DO NOT RUN IT AS A CLONE! ocf_log err Politely refusing to proceed to avoid data corruption. exit $OCF_ERR_CONFIGURED fi ;; esac The message is only printed if the variable OCF_RESKEY_CRM_meta_clone is non-zero. Our configuration however does not run the filesystem as a clone. Somehow the OCF_RESKEY_CRM_meta_clone variable leaked into the start of the Filesystem resource. Is this a known bug? Or is there a configuration error on our side? Here is the current configuration: node laplace \ attributes standby=off node ries \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.143.228 cidr_netmask=24 \ op monitor interval=30s \ meta target-role=Started primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=home-data \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_afs ocf:heartbeat:exportfs \ params fsid=1 directory=/srv/nfs/afs options=rw,no_root_squash,mountpoint \ clientspec=192.168.143.0/255.255.255.0 \ wait_for_leasetime_on_stop=false \ op monitor interval=30s primitive p_fs_afs ocf:heartbeat:Filesystem \ params device=/dev/home-data/afs directory=/srv/nfs/afs \ fstype=ext4 \ op monitor interval=10s \ meta target-role=Started primitive p_lsb_nfsserver lsb:nfs-kernel-server \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=home-data \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_lsb_nfsserver p_lsb_nfsserver colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.0.9-unknown \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1302610197 rsc_defaults $id=rsc-options \ resource-stickiness=200 Thanks Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Resource agent implementing SPC-3 Persistent Reservations (contribution from Evgeny Nifontov)
Awesome. I was wondering if someone would ever write one of these :) On Tue, Apr 12, 2011 at 10:29 AM, Florian Haas florian.h...@linbit.com wrote: Hi everyone, Evgeny Nifontov has started to implement sg_persist, a resource agent managing SPC-3 Persistent Reservations (PRs) using the sg_persist binary. He's put up a personal repo on Github and the initial commit is here: https://github.com/nif/ClusterLabs__resource-agents/commit/d0c46fb35338d28de3e2c20c11d0ad01dded13fd I've added some comments for an initial review. Everyone interested please pitch in. Thanks to Evgeny for an the contribution! Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Problem using Stonith external/ipmi device
On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson m.richard...@ed.ac.uk wrote: On 07/04/11 16:36, Dejan Muhamedagic wrote: For whatever reason stonith-ng doesn't think that stonithipmidisk1 can manage this node. Which version of Pacemaker do you run? Perhaps this has been fixed in the meantime. I cannot recall right now if there has been such a problem, but it's possible. You can also try to turn debug on and see if there are more clues. I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on el5. I've tried turning on debug, but there's no more information coming out in the logs. man stonithd has the bits you need. start with pcmk_host_check Thanks, Matthew -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA software download
On Wed, Apr 6, 2011 at 1:28 PM, Ajaykumar Narayanaswamy ajaykumar_narayanasw...@mindtree.com wrote: Hi All, I would like to know whether Linux OS has any inbuilt HA/Failover software or should we procure some third-party HA s/w. I came to know about heartbeat package which is an Open source application and also have downloaded the same, but does this help in providing failover for LDAP Server running on Linux OS for about 2000 SAP Users who would be using it for authentication. Heartbeat and/or Pacemaker (the bit that used to be the crm in heartbeat v2) are shipped by most major distributions and can handle this kind of task. Looking forward to hearing from you. Regards, Ajay http://www.mindtree.com/email/disclaimer.html ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] When is the next release for resource agents?
On Wed, Apr 6, 2011 at 4:55 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - When is the next release for resource agents? Agents that come with resource-agents-1.0.3-2.6.el5 form clusterlabs repository are very outdated.pgsql is at least one year old or so. in most cases there's not really a need for clusterlabs to ship the entire stack anymore (plus its a heap of work for me). instead the pacemaker packages simply build against whatever the distro provides. el5 is the exception, i'll try to update it there in the coming days. -- Serge Dubrouski. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA software download
On Thu, Apr 7, 2011 at 12:12 PM, Ajaykumar Narayanaswamy ajaykumar_narayanasw...@mindtree.com wrote: Hi Andrew, I have a small query, is it okay to have two LDAP servers running on Linux OS configured in Active/Active mode, else they should be configured in Active/Active mode, because as per my knowledge.. LDAP HA should be configured in Active/passive, one server should be enabled in R/W mode and other should be in R/O mode. so that we can sync the R/O or else syncing will be a problem. and in case the active goes down we can make the passive live by changing the IP address. Could you please throw some light on this query??? No. I've never run an ldap server. Sorry. Thx for lending help.. Regards, Ajaykumar -Original Message- From: Ajaykumar Narayanaswamy Sent: Thursday, April 07, 2011 12:41 PM To: 'Andrew Beekhof' Subject: RE: [Linux-HA] HA software download Thx a lot Andrew.. Indeed a great help, thx a lot once again. Regards, Ajaykumar -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Thursday, April 07, 2011 12:30 PM To: General Linux-HA mailing list Cc: Ajaykumar Narayanaswamy Subject: Re: [Linux-HA] HA software download On Wed, Apr 6, 2011 at 1:28 PM, Ajaykumar Narayanaswamy ajaykumar_narayanasw...@mindtree.com wrote: Hi All, I would like to know whether Linux OS has any inbuilt HA/Failover software or should we procure some third-party HA s/w. I came to know about heartbeat package which is an Open source application and also have downloaded the same, but does this help in providing failover for LDAP Server running on Linux OS for about 2000 SAP Users who would be using it for authentication. Heartbeat and/or Pacemaker (the bit that used to be the crm in heartbeat v2) are shipped by most major distributions and can handle this kind of task. Looking forward to hearing from you. Regards, Ajay http://www.mindtree.com/email/disclaimer.html ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat ordering
On Tue, Apr 5, 2011 at 11:58 AM, Maxim Ianoglo dot...@gmail.com wrote: Hello, I have four serves in a HA cluster: NodeA NodeB NodeC NodeD There are defined three groups of resources and one inline resource: 1. group_storage ( NFS VIP, NFS Server, DRBD ) 2. group_apache_www (Domains VIPs and Apache) 3. group_nginx_www (Static files with nginx) 4. inline_nfs_client ( NFS client ) (1) should run only on NodeC or NodeD. NodeC is preferable. NodeD for backup. (2) should run on NodeC and NodeD. NodeD is preferable. NodeC for backup. (3) should run on NodeC and NodeD. NodeC is preferable. NodeD for backup. (4) should run on every node except for node on which (1) is located. I have following orders: (2) depends on (1) and (4) (3) depends on (1) and (4) (4) depends on (1) Collocations: (4) and (1) should not run on same node. The issue is that resource (4) chooses NodeC which is the default node for (1), so (1) had to choose another node but NodeC, so it goes to NodeD. How can I make resource (1) to choose it's node earlier that (4) and any other resource ? Swap the order resources are listed in the colocation constraint. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] why Cluster restarts A, before starting B on surviving node.
I meant in the form of a hb_report which contains the necessary logs and status information necessary to diagnose your issue. On Mon, Apr 4, 2011 at 12:11 PM, Muhammad Sharfuddin m.sharfud...@nds.com.pk wrote: On Mon, 2011-04-04 at 10:42 +0200, Andrew Beekhof wrote: On Thu, Mar 24, 2011 at 7:42 PM, Muhammad Sharfuddin m.sharfud...@nds.com.pk wrote: we have two resources A and B Cluster starts A on node1, and B on node2, while failover node for A is node2 and failover node for B is node1 B cant start without A, so I have following location rules: order first_A_then_B : A B Problem/Question now if B fails due to node failure, Cluster restarts A, before starting B on surviving node(node1). my question/problem, is why Cluster restarts A. my question/problem, is that you've given us no information on which to base a reply. SLES 11 SP1 updated SLE HAE SP1 + updated node1 hostname: this is a 'distributed' and/or 'Active/Active', two nodes Cluster. Scenario: Cluster starts resource A on node1, and resource B on node2, due to following location constraints: location PrimaryLoc-of-A A +inf: node1 location PrimaryLoc-of-B B +inf: node2 B is a resource that is dependent on resource A, therefor I have a order constraint: order first_A_then_B : A B Now node2 blown, so cluster starts moving resource B(i.e resource 'B' failover) on node1(where resource A is already running).. but during this process Cluster first stops and starts(restarts) resource A, and then starts B. Problem/Question: Why Cluster restarts resource 'A' during failover process of resource B ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm commands : how to reduce the delay between two commands
On Fri, Mar 25, 2011 at 2:07 PM, Alain.Moulle alain.mou...@bull.net wrote: Hi, I tried but it does not work : crm_resource -r resname -p target-role -v started because it adds a target-role=started as params whereis I already have a meta target-role=Stopped so resource does not start. So I tried : crm_resource -r resname -m -p target-role -v started then resource starts successfully. But with a loop: for i in {1..20}; do echo resname$i ; crm_resource -r resname$i -m -p target-role -v started; done The first one is started immediately, and the 19th other ones are started ~20s after the first one but all in one salvo. So it seems to be quite the same behavior as successive crm resource start resname$i commands. First command is taken in account immediately, then there is a delay perhaps before pooling eventuals other crm commands, but as during this delay , my loop has already sent 19 commands, these are taken in account in one shot when the new polling occurs. Meaning, that manually, if you wait that the expected result of your crm command is displayed on crm_mon, before sending the second one etc. there is always this 10 to 20s latency between each commands. (Same behavior inside scripts if the script waits for the command to be really completed by testing ...) Hope my description is clear enough ... Yes. Looks like something in core pacemaker. Could you file a bug for this and include the output of your above testcase but with - added to the crm_resource command line please? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and pacemaker interaction
On Mon, Apr 4, 2011 at 10:14 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Mon, Apr 04, 2011 at 09:43:27AM +0200, Andrew Beekhof wrote: I am missing the state: running degraded or suboptimal. Yep, degraded is not a state available for pacemaker. Pacemaker cannot do much about suboptimal. Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE). And, of course, OCF_MASTER_BUT_ONLY_ONE_FAILURE_AWAY_FROM_COMPLETE_DATA_LOSS Feeling quite alright there? If it makes people happy to see Master/Slave Set: ms_drbd_data (DEGRADED) p_drbd_data:0 (ocf::linbit:drbd): Master bk1 (DEGRADED) p_drbd_data:1 (ocf::linbit:drbd): Slave bk2 (DEGRADED) in crm_mon, then sure, go for it. Other than that, I don't think that pacemaker can do much about degraded resources. The intention was that the PE would treat it the same as OCF_RUNNING - hence the name. It would exist purely to give admin tools the ability to provide additional feedback to users - like you outlined above. Essentially it would be a way for the RA to say Something isn't right, but you (ie. pacemaker) shouldn't do anything about it other than let a human know. Anything more complex is WAY out of scope. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and pacemaker interaction
On Tue, Apr 5, 2011 at 9:42 AM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: Am 04.04.2011 22:14, schrieb Lars Ellenberg: On Mon, Apr 04, 2011 at 09:43:27AM +0200, Andrew Beekhof wrote: I am missing the state: running degraded or suboptimal. Yep, degraded is not a state available for pacemaker. Pacemaker cannot do much about suboptimal. Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE). And, of course, OCF_MASTER_BUT_ONLY_ONE_FAILURE_AWAY_FROM_COMPLETE_DATA_LOSS What about using the standard output of the monitor operation as a status string that is displayed by crm_mon if available? I can imagine that such a change is less intrusive. Far from it, we'd need to start storing the stdout result in the CIB. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and pacemaker interaction
On Fri, Apr 1, 2011 at 8:13 PM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: Am 01.04.2011 16:38, schrieb Lars Ellenberg: On Fri, Apr 01, 2011 at 11:35:19AM +0200, Christoph Bartoschek wrote: Am 01.04.2011 11:27, schrieb Florian Haas: On 2011-04-01 10:49, Christoph Bartoschek wrote: Am 01.04.2011 10:27, schrieb Andrew Beekhof: On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote: I am missing the state: running degraded or suboptimal. Yep, degraded is not a state available for pacemaker. Pacemaker cannot do much about suboptimal. I wonder what it would take to change that. I suspect either a crystal ball or way too much knowledge of drbd internals. The RA would be responsible to check this. For drbd any diskstate different from UpToDate/UpToDate is suboptimal. Have you actually looked at the resource agent? It does already evaluate the disk state and adjusts the master preference accordingly. What else is there to do? Maybe I misunderstood Andrew's comment. I read it this way: If we introduce a new state suboptimal, would it be hard to detect it? I just wanted to express that detecting suboptimality seems not to be that hard. But that state is useless for pacemaker, since it cannot do anything about it. I thought I made that clear. You made clear that pacemaker cannot do anything about it. However crm_mon could report it. One may think that is can be neglected. But the current output of crm_mon is unexpected for me. Maybe we need to add OCF_RUNNING_BUT_DEGRADED to the OCF spec (and the PE). ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFS cluster after node crash
On Thu, Mar 24, 2011 at 9:58 PM, Christoph Bartoschek bartosc...@or.uni-bonn.de wrote: It seems as if the g_nfs service is stopped on the surviving node when the other one comes up again. To me it looks like the service gets stopped after it fails: p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, status=complete): not running Does anyone see a reason why the service does not continue to run? Christoph Am 22.03.2011 22:37, schrieb Christoph Bartoschek: Hi, I've created a NFS cluster after the linbit tutorial Highly available NFS storage with DRBD and Pacemaker. Generally it seems to work fine. Today I simlated a node crash by just turning a maschine off. Failover went fine. After 17 seconds the second node was able to serve the clients. But when I started the crashed node again the service went down. I wonder why the cluster did not just restart the services on the new node? Instead it tried to change status on the surviving node. What is going wrong? The resulting status is: Online: [ ries laplace ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ ries ] Slaves: [ laplace ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ ries laplace ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started ries p_fs_afs (ocf::heartbeat:Filesystem): Started ries (unmanaged) FAILED p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped Clone Set: cl_exportfs_root [p_exportfs_root] p_exportfs_root:0 (ocf::heartbeat:exportfs): Started laplace FAILED Started: [ ries ] Failed actions: p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, status=complete): not running p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out): unknown exec error My configuration is: node laplace \ attributes standby=off node ries \ attributes standby=off primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=afs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_root ocf:heartbeat:exportfs \ params fsid=0 directory=/srv/nfs options=rw,no_root_squash,crossmnt clientspec=192.168.1.0/255.255.255.0 wait_for_leasetime_on_stop=1 \ op monitor interval=30s \ op stop interval=0 timeout=100s primitive p_fs_afs ocf:heartbeat:Filesystem \ params device=/dev/afs/afs directory=/srv/nfs/afs fstype=ext4 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=192.168.1.100 cidr_netmask=24 \ op monitor interval=30s \ meta target-role=Started primitive p_lsb_nfsserver lsb:nfsserver \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=afs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \ meta target-role=Started ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_exportfs_root p_exportfs_root \ meta target-role=Started clone cl_lsb_nfsserver p_lsb_nfsserver \ meta target-role=Started colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master colocation c_nfs_on_root inf: g_nfs cl_exportfs_root order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver cl_exportfs_root:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1300828539 rsc_defaults $id=rsc-options \ resource-stickiness=200 Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Stonith resource appears to be active on 2 nodes ...
On Mon, Apr 4, 2011 at 9:03 AM, Alain.Moulle alain.mou...@bull.net wrote: Hi, I got this error : 1301591983 2011 Mar 31 19:19:43 berlin5 daemon err crm_resource [36968]: ERROR: native_add_running: Resource stonith::fence_ipmilan:restofenceberlin4 appears to be active on 2 nodes. 1301591983 2011 Mar 31 19:19:43 berlin5 daemon warning crm_resource [36968]: WARN: See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information I check on this URL, and there are two listed potential causes : 1. the resource is started at boot time : this is for sure not the case. 2. the monitor op in fence_ipmilan could be implemented not correctly ? Is a stonith resource to be mandatorilly an OCF script ? This should be a higher level issue. Possibly in stonith-ng. Logs? I check the fence_ipmilan source : else if (!strcasecmp(op, status) || !strcasecmp(op, monitor)) { printf(Getting status of IPMI:%s...,ip); fflush(stdout); ret = ipmi_op(i, ST_STATUS, power_status); switch(ret) { case STATE_ON: if (!strcasecmp(op, status)) printf(Chassis power = On\n); translated_ret = ERR_STATUS_ON; ret = 0; break; case STATE_OFF: if (!strcasecmp(op, status)) printf(Chassis power = Off\n); translated_ret = ERR_STATUS_OFF; ret = 0; break; default: if (!strcasecmp(op, status)) printf(Chassis power = Unknown\n); translated_ret = ERR_STATUS_FAIL; ret = 1; break; } Any idea about where could be potentially the problem ? (knowing that I think that fence_ipmilan is NOT an OCF script, but a stonith script delivered by RH as fence_agents) Thanks a lot Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] why Cluster restarts A, before starting B on surviving node.
On Thu, Mar 24, 2011 at 7:42 PM, Muhammad Sharfuddin m.sharfud...@nds.com.pk wrote: we have two resources A and B Cluster starts A on node1, and B on node2, while failover node for A is node2 and failover node for B is node1 B cant start without A, so I have following location rules: order first_A_then_B : A B Problem/Question now if B fails due to node failure, Cluster restarts A, before starting B on surviving node(node1). my question/problem, is why Cluster restarts A. my question/problem, is that you've given us no information on which to base a reply. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] update question
On Mon, Mar 28, 2011 at 9:08 PM, Miles Fidelman mfidel...@meetinghouse.net wrote: Hi Folks, I'm getting ready to upgrade a 2-node HA cluster from Debian Etch to Squeeze. I'd very much appreciate any suggestions regarding gotchas to avoid, and so forth. Basic current configuration: - 2 vanilla Intel-based servers, 4 SATA drives each - each machine: disks set up as 4-drive md10, LVM - xen 3.2 hypervisor, Debian Etch Dom0 - DRBD 8.2, Pacemaker linking the two nodes - several Debian Etch DomUs Target configuration: - update to xen 4.1, Debian Squeeze Dom0 - update to latest DRBD, Pacemaker - update DomUs on a case-by-case basis The most obvious question: Are later versions of Xen, DRBD, Pacemaker compatible with the older ones? I.e., can I take the simple approach: - migrate all DomUs to one machine - take the other machine off-line - upgrade Debian, Xen, DRBD, Pacemaker on the off-line node - bring that node back on-line (WILL THE NEW VERSIONS OF XEN, DRBD, PACEMAKER SYNC WITH THE PREVIOUS RELEASES ON THE OTHER NODE???) I can only comment on Pacemaker: I think so, but you haven't indicated which versions - migrate stuff to the updated node - take the 2nd node off-line, update everything, bring it back up, resync 1. Will that work? (If not: Alternative suggestions?) 2. Anything to watch out for? 3. As an alternative, does it make sense to install Ganeti and use it to manage the cluster? If so, any suggestions on a migration path? (Yes, it would be easier if I had one or two additional servers to use for intermediate staging, but such is life.) Thanks Very Much, Miles Fidelman -- In theory, there is no difference between theory and practice. Infnord practice, there is. Yogi Berra ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and pacemaker interaction
On Fri, Apr 1, 2011 at 4:38 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Apr 01, 2011 at 11:35:19AM +0200, Christoph Bartoschek wrote: Am 01.04.2011 11:27, schrieb Florian Haas: On 2011-04-01 10:49, Christoph Bartoschek wrote: Am 01.04.2011 10:27, schrieb Andrew Beekhof: On Sat, Mar 26, 2011 at 12:10 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote: I am missing the state: running degraded or suboptimal. Yep, degraded is not a state available for pacemaker. Pacemaker cannot do much about suboptimal. I wonder what it would take to change that. I suspect either a crystal ball or way too much knowledge of drbd internals. The RA would be responsible to check this. For drbd any diskstate different from UpToDate/UpToDate is suboptimal. Have you actually looked at the resource agent? It does already evaluate the disk state and adjusts the master preference accordingly. What else is there to do? Maybe I misunderstood Andrew's comment. I read it this way: If we introduce a new state suboptimal, would it be hard to detect it? No, detecting is the easy part. I just wanted to express that detecting suboptimality seems not to be that hard. But that state is useless for pacemaker, since it cannot do anything about it. This was the part I was wondering about - if pacemaker _could_ do something intelligent. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Sort of crm commandes but off line ?
On Thu, Mar 24, 2011 at 2:32 PM, Alain.Moulle alain.mou...@bull.net wrote: Hi, Ok I think my question was not clear : in fact, the pb is not to do or not ssh node crm ... , the pb is just to know the hostname of the node to ssh it , in another way than parsing the cib.xml to know which other nodes are in the same HA cluster as the node where I am (knowing that corosync is stopped on this local node) . Add a floating IP with no quorum requirement (so that its always running as long as at least one node is) and set up an A record pointing clusterX.bull.net to it? Thanks Regards. Alain This might sound obvious but is an ssh call acceptable? On 3/23/2011 8:38 AM, Alain.Moulle wrote: Hi, I'm looking for a command which will give to me information of the HA cluster , such as for example all nodes hostnames which are in the same HA cluster BUT from a node where Pacemaker is not active. For example: I have a cluster with node1 , node2, node3 Pacemaker is running on node2 node3 Pacemaker is not running on node1 , so any crm command returns Signon to CIB failed: connection failed Init failed, could not perform requested operations I'm on node1 : I want to know (by script) if Pacemaker is active on at least another node in the HA cluster including the node where I am (so node1) Is there a command which could give me such information offline , or do I have to scan the uname fields in the recordnodes /nodes and to ssh on other nodes to get information ... Thanks Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] comments required on location rules
should work. depends what stickiness value you're using On Sat, Mar 19, 2011 at 11:48 AM, Muhammad Sharfuddin m.sharfud...@nds.com.pk wrote: we have two resource groups 'grp-SAPDatabase' and 'grp-SAPInstances' To better utilize our both machines, I want to run the 'grp-SAPDatabase' and 'grp-SAPInstances' resources on different nodes, so that we can use both nodes simultaneously, otherwise if both resources run on single node, then the other node remain idle/passive. I wan that grp-SAPDatabase resource must always run on node1, and grp-SAPDatabase failover node is 'node2' while grp-SAPInstances resource must always run on node 'node2', and grp-SAPInstance failover node is 'node1' and for that I have created following location rules: location PrimaryLoc-of-grpSAPDatabase grp-SAPDatabase +inf: node1 location PrimaryLoc-of-grpSAPInstances grp-SAPInstances +inf: node2 please provide your comments/feedbacks on above rules. -- Regards, Muhammad Sharfuddin | NDS Technologies Pvt Ltd | cell: +92-333-2144823 | UAN: +92-21-111-111-142 ext: 113 The London Stock Exchange moves to SUSE Linux http://www.computerworlduk.com/news/open-source/3260727/london-stock-exchange-in-historic-linux-go-live/ http://www.zdnet.com/blog/open-source/the-london-stock-exchange-moves-to-novell-linux/8285 Your Linux is Ready http://www.novell.com/linux ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] need help to configure the fence_ifmib for stonith
On Thu, Mar 17, 2011 at 11:25 AM, Amit Jathar amit.jat...@alepo.com wrote: Hi, I would like to try the fence_ifmib as the fencing agent. I can see it is present in my machine. [root@OEL6_VIP_1 fence]# ls /usr/sbin/fence_ifmib /usr/sbin/fence_ifmib Also, I can see some python scripts present on my machine :- [root@OEL6_VIP_1 fence]# pwd /usr/share/fence [root@OEL6_VIP_1 fence]# ls fencing.py fencing.pyc fencing.pyo fencing_snmp.py fencing_snmp.pyc fencing_snmp.pyo [root@OEL6_VIP_1 fence]# Is there any chance I can configure the if_mib as the stonith agent. yes, but only if you have pacemaker 1.1.x If yes, then which MIB files shall I need ? no idea ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] new resource agents repository commit policy
On Mon, Mar 14, 2011 at 6:07 PM, Dejan Muhamedagic de...@suse.de wrote: Hello everybody, It's time to figure out how to maintain the new Resource Agents repository. Fabio and I already discussed this a bit in IRC. There are two options: a) everybody gets an account at github.com and commit rights, where everybody is all people who had commit rights to linux-ha.org and rgmanager agents repositories. b) several maintainers have commit rights and everybody else sends patches to a ML; then one of the maintainers does a review and commits the patch (or pulls it from the author's repository). I suspect you want b) with maybe 6 people for redundancy. The pull request workflow should be well suited to a project like this and impose minimal overhead. The ability to comment on patches in-line before merging them should be pretty handy. You're also welcome to put a copy at http://www.clusterlabs.org/git/ Its pretty easy to keep the two repos in sync, for example I have this in .git/config for matahari: [remote origin] fetch = +refs/heads/*:refs/remotes/origin/* url = g...@github.com:matahari/matahari.git pushurl = g...@github.com:matahari/matahari.git pushurl = ssh://beek...@git.fedorahosted.org/git/matahari.git git push then sends to both locations Option a) incurs a bit less overhead and that's how our old repositories worked. Option b) gives, at least nominally, more control to the select group of maintainers, but also places even more burden on them. We are open for either of these. Cheers, Fabio and Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration
On Fri, Mar 11, 2011 at 7:23 PM, Randy Katz rk...@simplicityhosting.com wrote: On 3/11/2011 3:29 AM, Dejan Muhamedagic wrote: Hi, On Fri, Mar 11, 2011 at 01:36:25AM -0800, Randy Katz wrote: On 3/11/2011 12:50 AM, RaSca wrote: Il giorno Ven 11 Mar 2011 07:32:32 CET, Randy Katz ha scritto: ps - in /var/log/messages I find this: Mar 10 22:31:45 drbd1 lrmd: [3274]: ERROR: get_resource_meta: pclose failed: Interrupted system call Mar 10 22:31:45 drbd1 lrmd: [3274]: WARN: on_msg_get_metadata: empty metadata for ocf::linbit::drbd. Mar 10 22:31:45 drbd1 lrmadmin: [3481]: ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg. [...] Hi, I think that the message no such resource agent is explaining what's the matter. Does the file /usr/lib/ocf/resource.d/linbit/drbd exists? Is the drbd file executable? Have you correctly installed the drbd packages? Check those things, you can try to reinstall drbd. Hi # ls -l /usr/lib/ocf/resource.d/linbit/drbd -rwxr-xr-x 1 root root 24523 Jun 4 2010 /usr/lib/ocf/resource.d/linbit/drbd Which cluster-glue version do you run? Try also: # lrmadmin -C # lrmadmin -P ocf drbd # export OCF_ROOT=/usr/lib/ocf # /usr/lib/ocf/resource.d/linbit/drbd meta-data I am running from a source build/install as per clusterlabs.org as the rpm's had broken dependencies and would not install. Its a good idea to report that, with details, so that it can get fixed. I have now blown away that CentOS (one of them) machine and installed openSUSE as they said everything was included but it seems on 11.3 not on 11.4, on 11.4 the install is broken and so now running some later later versions and running into some other issues, will report back with findings. What os distro is the least of the problems to get this stuff running on? I just want to get it running, run a few tests, and then figure out where to go from there. Thanks, Randy ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] GFS2 mounting hangs
On Thu, Mar 10, 2011 at 5:53 PM, Jonathan Schaeffer jonathan.schaef...@univ-brest.fr wrote: Hi, I'm trying to setup a pacemaker cluster based on DRBD Active/Active and GFS2. Everything is working fine on normal startup. But when I try to mess around with the cluster, I come across unrecoverable problems with the GFS2 partition mounting. Here is what I did and what happens : - Remove the network link between the two nodes. - Show how the cluster behaves for a while - Get the network interface up again - As one machine whas stonithed by the other (meatware for the tests), I restarted the node. Did you run the meatware confirmation command too? - on reboot, the cluste can't get the FileSystem resource up and hit timeout. This is what I did to show details of the mounting operation : # strace /sbin/mount.gfs2 /dev/drbd0 /data -o rw ... socket(PF_FILE, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_FILE, path=@gfsc_sock}, 12) = 0 write(3, \\o\\o\1\0\1\0\7\0\0\0\0\0\0\0`p\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 28768) = 28768 read(3, I suspect there is a problem with the DLM holding one more lock than necessary. The GFS partition was created with 2 journals (and has to run on 2 nodes). Does someone rely on such setup for a prodution use ? Realy ? If so, can you help me debug my problem ? The pacemaker config is pretty much as in the docs (DRBD+GFS2). In case it matters, the config is shown below. Thank you ! node orque \ attributes standby=false node orque2 \ attributes standby=off primitive drbd-data ocf:linbit:drbd \ params drbd_resource=orque-raid \ op start interval=0 timeout=240s start-delay=5s \ op stop interval=0 timeout=100s \ op monitor interval=30s timeout=30s start-delay=5s primitive dlm ocf:pacemaker:controld \ op monitor interval=120s \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s primitive gfs-control ocf:pacemaker:controld \ params daemon=gfs_controld.pcmk args=-g 0 \ op monitor interval=120s \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s primitive orque-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/orque-raid directory=/data fstype=gfs2 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive kvm-adonga ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/adonga.xml hypervisor=qemu:///system migration_transport=ssh \ meta allow-migrate=true target-role=Started is-managed=true \ op start interval=0 timeout=200s \ op stop interval=0 timeout=200s \ op monitor interval=10 timeout=200s on-fail=restart depth=0 primitive kvm-observatoire-test ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/observatoire-test.xml hypervisor=qemu:///system migration_transport=ssh \ meta allow-migrate=true target-role=Started is-managed=true \ op start interval=0 timeout=200s \ op stop interval=0 timeout=200s \ op monitor interval=10 timeout=200s on-fail=restart depth=0 primitive kvm-testVM ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/testVM.xml hypervisor=qemu:///system migration_transport=ssh \ meta allow-migrate=true target-role=Stopped is-managed=true \ op start interval=0 timeout=200s \ op stop interval=0 timeout=200s \ op monitor interval=10 timeout=200s on-fail=restart depth=0 primitive orque-fencing stonith:meatware \ params hostlist=orque \ meta is-managed=true primitive orque2-fencing stonith:meatware \ params hostlist=orque2 \ meta is-managed=true target-role=Started ms drbd-data-clone drbd-data \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone dlm-clone dlm \ meta interleave=true target-role=Started clone gfs-clone gfs-control \ meta interleave=true target-role=Started clone orque-fs-clone orque-fs \ meta is-managed=true target-role=Started interleave=true ordered=true location kvm-testVM-prefers-orque kvm-testVM 50: orque location loc-orque-fencing orque-fencing -inf: orque location loc-orque2-fencing orque2-fencing -inf: orque2 colocation gfs-with-dlm inf: gfs-clone dlm-clone colocation kvm-adonga-with-orque-fs inf: kvm-adonga orque-fs-clone colocation kvm-observatoire-test-with-orque-fs inf: kvm-observatoire-test orque-fs-clone colocation kvm-testVM-with-orque-fs inf: kvm-testVM orque-fs-clone colocation orque-fs-with-gfs-control inf: orque-fs-clone gfs-clone order gfs-after-dlm inf: dlm-clone gfs-clone order kvm-adonga-after-orque-fs inf: orque-fs-clone kvm-adonga order kvm-observatoire-test-after-orque-fs inf: orque-fs-clone kvm-observatoire-test order kvm-testVM-after-orque-fs inf: orque-fs-clone kvm-testVM order orque-fs-after-drbd-data inf:
Re: [Linux-HA] Active/active cluster with connectivity check
On Thu, Mar 10, 2011 at 3:54 PM, Artur linu...@netdirect.fr wrote: Hello, I'm currently switching to Heartbeat (3.0.3) and Pacemaker (1.0.9.1) on Debian Squeeze with CRM/CIB setup. This is the first time i try to configure it so please be kind with a newbie. :) I would like to setup an active/active cluster with 2 nodes, with sticky resources and a connectivity check. At this time i have a basic working solution without connectivity check. The configuration follows at the bottom of the email. But i am unable to configure the connectivity check with the pingd resource. The (basic) configuration explained : - on node p01 there are 2 sticky resources called WWW1 (apache web server) and VIP1 (virtual IP) - on node p17 there is 1 sticky resource called VIP2 (virtual IP) In my mind this is how it should work : - if a node is down its sticky resources go to the other node (this works) - if a node has no connectivity the resources go to the other node (unable to make it work) - if a down node goes up, the sticky resources are migrated back on it (this works in the current setup with no connectivity check) I created a pingd primitive and cloned it as explained in tutorials and tried some rules but with no success. I tried to add the following rules with no success : location www1-on-p01-connected WWW1 \ rule $id=www1-on-p01-connected-rule pingd: defined pingd \ rule $id=www1-on-p01-connected-rule-0 -1000: not_defined pingd or pingd lte 0 \ change -1000 to -INFINITY rule $id=www1-on-p01-connected-rule-1 1000: #uname eq p01 location vip2-on-p17-connected VIP2 \ rule $id=vip2-on-p17-connected-rule pingd: defined pingd \ rule $id=vip2-on-p17-connected-rule-0 -1000: not_defined pingd or pingd lte 0 \ here too rule $id=vip2-on-p17-connected-rule-1 1000: #uname eq p17 This is the current setup without active connectivity check but with cloned pingd primitive : node $id=52dadc12-ada6-46a0-8474-639a62dfa3ad p17 node $id=6d96beed-abd9-4ad1-9a92-b6560abc0475 p01 primitive VIP1 ocf:heartbeat:IPaddr2 \ params ip=192.168.1.201 cidr_netmask=32 iflabel=vip1 \ op monitor interval=30s primitive VIP2 ocf:heartbeat:IPaddr2 \ params ip=192.168.1.202 cidr_netmask=32 iflabel=vip2 \ op monitor interval=30s primitive WWW1 ocf:heartbeat:apache \ params configfile=/etc/apache2/apache2.conf primitive pingd ocf:pacemaker:pingd \ params host_list=192.168.1.254 multiplier=1000 \ op monitor interval=15s timeout=5s clone pingdclone pingd \ meta globally-unique=false location vip2-on-p17 VIP2 250: p17 location www1-on-p01 WWW1 250: p01 colocation www1-with-vip1 inf: WWW1 VIP1 order www1-after-vip1 inf: VIP1 WWW1 property $id=cib-bootstrap-options \ stonith-enabled=false \ no-quorum-policy=ignore \ dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \ cluster-infrastructure=Heartbeat Any ideas about how to make it work ? -- Best regards, Artur. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource not restarted due to score value
Happily this appears to be fixed in 1.1.5 (which I believe should be available for SLES soon). On Mon, Feb 7, 2011 at 9:17 AM, Haussecker, Armin armin.haussec...@ts.fujitsu.com wrote: Hi, we have sles11 sp1 with pacemaker 1.1.2-0.7.1 and corosync 1.2.6-0.2.2. Attached please find cibadmin -Ql before stopping StorGr1 on node goat1 (diag.before) cibadmin -Ql after stopping StorGr1 on node goat1 (diag.after) crm_mon after stopping StorGr1 on node goat1 (diag.crm_mon) ptest -sL after stopping StorGr1 on node goat1 (diag.ptest) Regards, Armin Haussecker -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andrew Beekhof Sent: Monday, February 07, 2011 8:46 AM To: General Linux-HA mailing list Subject: Re: [Linux-HA] resource not restarted due to score value On Fri, Feb 4, 2011 at 12:02 PM, Haussecker, Armin armin.haussec...@ts.fujitsu.com wrote: Hi, in our 2-node-cluster we have a clone resource StorGr1 and two primitive resources DummyVM1 and DummyVM2. StorGr1 should be started before DummyVM1 and DummyVM2 due to order constraints. StorGr1 clone was started on both cluster nodes goat1 and sheep1. DummyVM1 and DummyVM2 were both started on node goat1. Then we stopped StorGr1 on node goat1. We expected a restart of DummyVM1 and DummyVM2 on the second node sheep1 due to the order constraints. But only DummyVM2 was restarted on the second node sheep1. DummyVM1 was stopped and remained in the stopped state: Clone Set: StorGr1-clone [StorGr1] Started: [ sheep1 ] Stopped: [ StorGr1:1 ] DummyVM1 (ocf::pacemaker:Dummy): Stopped DummyVM2 (ocf::pacemaker:Dummy): Started sheep1 Difference: DummyVM1 has a higher allocation score value for goat1 and DummyVM2 has a higher allocation score value for sheep1. How can we achieve a restart of the primitive resources independently of the allocation score value ? Do we need other or additional constraints ? Shouldn't need to. Please attach the result of cibadmin -Ql when the cluster is in this state. Also some indication of what version you're running would be helpful. Best regards, Armin Haussecker Extract from CIB: primitive DummyVM1 ocf:pacemaker:Dummy \ op monitor interval=60s timeout=60s \ op start on-fail=restart interval=0 \ op stop on-fail=ignore interval=0 \ meta is-managed=true resource-stickiness=1000 migration-threshold=2 primitive DummyVM2 ocf:pacemaker:Dummy \ op monitor interval=60s timeout=60s \ op start on-fail=restart interval=0 \ op stop on-fail=ignore interval=0 \ meta is-managed=true resource-stickiness=1000 migration-threshold=2 primitive StorGr1 ocf:heartbeat:Dummy \ op monitor on-fail=restart interval=60s \ op start on-fail=restart interval=0 \ op stop on-fail=ignore interval=0 \ meta is-managed=true resource-stickiness=1000 migration-threshold=2 clone StorGr1-clone StorGr1 \ meta target-role=Started interleave=true ordered=true location score-DummyVM1 DummyVM1 400: goat1 location score-DummyVM2 DummyVM2 400: sheep1 order start-DummyVM1-after-StorGr1-clone inf: StorGr1-clone DummyVM1 order start-DummyVM2-after-StorGr1-clone inf: StorGr1-clone DummyVM2 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Not able to stop services in group individually.
On Wed, Mar 2, 2011 at 9:31 AM, Caspar Smit c.s...@truebit.nl wrote: Hi, I have the following (simple) configuration: primitive iscsi0 ocf:heartbeat:iscsi \ params portal=*172.20.250.5 *target=iqn.2010-10.nl.nas:nas.storage0 primitive iscsi1 ocf:heartbeat:iscsi \ params portal=*172.20.250.21 *target=iqn.2010-10.nl.nas:nas.storage1 primitive failover-ip0 ocf:heartbeat:IPaddr2 \ params ip=172.20.60.13 iflabel=0 primitive lvm0 ocf:heartbeat:LVM \ params volgrpname=vg0 exclusive=yes primitive filesystem0 ocf:heartbeat:Filesystem \ params device=/dev/vg0/lv0 directory=*/mnt/storage* fstype=xfs primitive filesystem1 ocf:heartbeat:Filesystem \ params device=/dev/vg0/lv1 directory=*/mnt/storage2* fstype=xfs primitive nfs-server lsb:nfs-kernel-server primitive samba-server lsb:samba group nfs-and-samba-group iscsi0 iscsi1 failover-ip0 lvm0 filesystem0 filesystem1 nfs-server samba-server location nfs-and-samba-group-prefer-node01 nfs-and-samba-group 100: node01 So two iscsi initiators, then LVM on top of those, two filesystems (one for nfs exports and one for a samba share). What I noticed is that when I want to only stop the nfs-server (for doing some maintenance for instance) the samba-server is stopped also (because it is in a group and the order in the group seems like every primitive is required for the next primitive) Right, thats how groups are supposed to work. (pacemaker reads the group from left to right) How would I be able to stop nfs-server and/or samba-server without interupting anything else in the group? set is-managed=false for the group perhaps? Should I split those two from the group? But then I would need more constraints telling that the nfs-server and samba server can only start at the node were the iscsi initiator/LVM is up. And when I want to migrate the nfs-server to the other node, the samba-server and iscsi/LVM need to migrate also because of the large vg0? Can anyone tell me how to accomplish this? Thanks you very much in advance, Caspar Smit ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Load CRM-Konfiguration from file
try: crm configure filename On Wed, Mar 9, 2011 at 4:34 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi there, is it possible to exchange a complete CIB with an other CIB? The background is, that we have to roll out the same cluster in different customer enviroments with different IPs / networks. Instead of manipulating the CIB by hand via CRM, I'd rather replace placeholders in a template cib via a script. I tried crm -f filename and crm filename to no avail. crm then commits the changes line-by-line imediately, which can lead to undesireable sideeffects (because some primitives start at once, where I acutally wanted the group to start instead etc.). Can one force crm in a batch mode, where commit happens only when I want it to happen? Could I instead exchange the CIB-XML-file? If yes, which prerequisites do I have to take care of (I guess the cluster should be stopped, including corosync, right?)? An how do I generate a CIB-file without the status-information of a running system? Would you be so kind to point me to the right source of information (yes, that's a request for a RTFM *grin*). Thanks in advance, Andreas -- CONET Solutions GmbH Andreas Stallmann, Theodor-Heuss-Allee 19, 53773 Hennef Tel.: +49 2242 939-677, Fax: +49 2242 939-393 Mobil: +49 172 2455051 Internet: http://www.conet.de, mailto: astallm...@conet.demailto:astallm...@conet.de CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke H?fer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans J?rgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: R?diger Zeyen (Sprecher/Chairman), Wilfried P?tz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ping resource goes down and never comes up
On Thu, Feb 17, 2011 at 10:00 PM, RaSca ra...@miamammausalinux.org wrote: Hi all, is it possible that a ping_clone goes down on a node because there is no connectivity and never comes up again when the connectivity returns? The ping and clone resource is declared like this: primitive ping ocf:pacemaker:ping \ params host_list=192.168.100.1 name=ping \ op monitor interval=10s timeout=60s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s clone ping_clone ping meta globally-unique=false I had to force a cleanup on this resource to make it up again. Also if there are some resources connected by a location like this: location tomcat_on_connected_node tomcat_clone \ rule $id=tomcat_on_connected_node-rule -inf: not_defined ping or ping lte 0 to the ping status those went down when the ping dies and obviously never comes up again when the connection returns. we're not running ping as a daemon, its spawned every time monitor is called. hard to say much without logs Are there some parameters to force the cleanup or to manage these kind of situations? Thanks a lot! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] after configuring dlm resource, pacemaker on cman stack fails
Check the correct daemon is being started (looks like its still starting the pacemaker specific one). Check what happens when you start the daemon manually. On Fri, Feb 25, 2011 at 9:10 AM, Pieter Baele pieter.ba...@gmail.com wrote: I try to get clvmd (+cmirror) running on top of pacemaker - cman. After the initial setup, I defined a dlm resource primitive dlm ocf:pacemaker:controld op monitor interval=60 timeout=60 Maybe this is the wrong way or I missed a step? What else is required? (no instructions in previous [Pacemaker] CMAN integration questions) Feb 25 08:55:45 node01 crmd: [5859]: info: update_dc: Unset DC node02 Feb 25 08:55:45 node01 crmd: [5859]: info: do_state_transition: State transition S_NOT_DC - S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Feb 25 08:55:45 node01 crmd: [5859]: info: update_dc: Set DC to 1node02 (3.0.2) Feb 25 08:55:45 node01 cib: [5855]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-38.raw Feb 25 08:55:45 node01 cib: [5855]: info: write_cib_contents: Wrote version 0.24.0 of the CIB to disk (digest: 4bcc6b0560ade75509e811d2cb89e3fa) Feb 25 08:55:45 node01 cib: [5855]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.dZXxI6 (digest: /var/lib/heartbeat/crm/cib.KnMxRG) Feb 25 08:55:45 node01 crmd: [5859]: info: do_state_transition: State transition S_PENDING - S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_local_callback: Sending full refresh (origin=crmd) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-dlm (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ClusteredIP (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-dlm (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ClusteredIP (null) Feb 25 08:55:45 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Feb 25 08:55:46 node01 crmd: [5859]: info: do_lrm_rsc_op: Performing key=9:13:0:fb5708b9-4afe-41d3-b3f1-b9a47a7f29c6 op=dlm_start_0 ) Feb 25 08:55:46 node01 lrmd: [5856]: info: rsc:dlm:6: start Feb 25 08:55:46 node01 lrmd: [5856]: info: RA output: (dlm:start:stderr) dlm_controld.pcmk: no process killed Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: get_cluster_type: Cluster type is: 'cman'. Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: get_local_node_name: Using CMAN node name: node01 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: init_ais_connection_once: Connection to 'cman': established Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: crm_new_peer: Node node01 now has id: 16847020 Feb 25 08:55:46 node01 cluster-dlm: [9561]: info: crm_new_peer: Node 16847020 is now known as node01 Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: crm_abort: send_ais_text: Forked child 9565 to record non-fatal assert at ais.c:345 : dest != crm_msg_ais Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: send_ais_text: Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0) Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: crm_abort: send_ais_text: Forked child 9566 to record non-fatal assert at ais.c:345 : dest != crm_msg_ais Feb 25 08:55:46 node01 cluster-dlm: [9561]: ERROR: send_ais_text: Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0) Feb 25 08:55:46 node01 dlm_controld.pcmk: [9561]: notice: terminate_ais_connection: Disconnecting from AIS Feb 25 08:55:47 node01 lrmd: [5856]: info: RA output: (dlm:start:stderr) dlm_controld.pcmk: no process killed Feb 25 08:55:47 node01 crmd: [5859]: info: process_lrm_event: LRM operation dlm_start_0 (call=6, rc=7, cib-update=17, confirmed=true) not running Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_ais_dispatch: Update relayed from node02 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-dlm (INFINITY) Feb 25 08:55:47 node01 crmd: [5859]: info: do_lrm_rsc_op: Performing key=2:14:0:fb5708b9-4afe-41d3-b3f1-b9a47a7f29c6 op=dlm_stop_0 ) Feb 25 08:55:47 node01 lrmd: [5856]: info: rsc:dlm:7: stop Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_perform_update: Sent update 81: fail-count-dlm=INFINITY Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_ais_dispatch: Update relayed from node02 Feb 25 08:55:47 node01 attrd: [5857]: info: attrd_trigger_update: Sending
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Wed, Mar 2, 2011 at 9:05 AM, Stallmann, Andreas astallm...@conet.de wrote: Hi Andrew, If suicide is no supported fencing option, why is it still included with stonith? Left over from heartbeat v1 days I guess. Could also be a testing-only device like ssh. www.clusterlabs.org tells me, you're the Pacemaker project leader. Yes, but the stonith devices come from cluster-glue. So I guess Dejan or Florian are nominally in charge of those, but they've not been changed in forever. Would you, by chance, know who maintains or maintained the suicide-stonith-plugin? It maybe testing-only, yes. But at least, ssh is working as intended. It's badly documented, and I didn't find a single (official) document on howto implement a (stable!) suicide-stonith, Because you can't. Suicide is not, will not, can not be reliable. Yes, you're right. But under certain circumstances (1. nodes are still alive, 2. both redundant communication channels [networks] are down, 3. policy requires no node to be up, which has no quorum) it might be a good addition to a regular stonith (because if [2] happens, pacemaker/stonith will probably not be able to control a network power switch etc.) Could we agree on that? Sure. But even if you have a functioning suicide plugin, Pacemaker cannot ever make decisions that assume it worked. Because for all it knows the other side might consider itself to be perfectly healthy. If not: What's your recommended setup for (resp. against) such situations? Think of split sites here! You still need reliable fencing, if you cant provide that, there needs to be a human in the loop. The whole point of stonith is to create a known node state (off) in situations where you cannot be sure if your peer is alive, dead or some state in-between. Yes, so don't file suicide under stonith! We implemented a different approach in a two node cluster: We wrote a script that checks (by means of cron) the connectivity (by means of ping) to the peer (if connected, everything fine) and then (if peer are not reachable) to some quorum nodes. If either the peer or a majority of the quorum nodes are alive, nothing happens. If quorum is lost, the node shut's itself down. Wonderful, but the healthy side still can't do anything, because it can't know that the bad side is down. So what have you gained over no-quorum-policy=stop (which is the default) ? We did that, because drbd tended to misbehave in situations, where all network connectivity was lost. We'd rather have a clean shutdown on both sides, than a corrupt filesystem. I always consider this solution as unelegant, mainly because it wasn't controllable via crm. Thus I hoped, I could forget this solution when using pacemaker. It seems, I can not. If there's any interest from the community in our suicide by cron-solution, tell me if and how to contribute. It requires a sick node to suddenly start functioning correctly - so attempting to self-terminate makes some sense, relying on it to succeed does not seem prudent. Ys! But it's not always the node, that's sick. Sometimes (even with the best and most redundant network), the connectivity between the node ist the problem, not a marauding pacemaker or openais! Again: Please tell me, what's your solution in that case? Again, tell me how the other side is supposed to know and what you gain? On the other hand, it doen't make any other sense to name a no-quorum-policy suicide, if it's anything, but a suicide (if, at all, one could name it assisted suicide). This question is still unanswered. Does no quorum-policy suicide really have a meaning? yes, for N 2, it is a faster version of stop Or is it as well a leftover from the times of heartbeat. no Is it still functional? yes ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Fri, Feb 25, 2011 at 12:51 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi! I conentrate both your answers into one mail, I hope that's allright for you. For now, I need an interim solution, which is, as of now, stonith via suicide. Doesn't work as suicide is not considered reliable - by definition the remaining nodes have no way to verify that the fencing operation was successful. Suspect it will still fail though, suicide isnt a supported fencing option - since obviously the other nodes can't confirm it happened. Ok then, I know I'm a little bit provocative right now: If suicide is no supported fencing option, why is it still included with stonith? Left over from heartbeat v1 days I guess. Could also be a testing-only device like ssh. It's badly documented, and I didn't find a single (official) document on howto implement a (stable!) suicide-stonith, Because you can't. Suicide is not, will not, can not be reliable. The whole point of stonith is to create a known node state (off) in situations where you cannot be sure if your peer is alive, dead or some state in-between. Suicide does not achieve this in any way, shape or form. It requires a sick node to suddenly start functioning correctly - so attempting to self-terminate makes some sense, relying on it to succeed does not seem prudent. but it's there, and thus it should be usable. If it isn't, the maintainer should please (please!) remove it or supply something that's working. I do know, that's quite demanding, because the maintainer will probably do the development in his (or her) free time. Still... I do as well agree, that suicide is a very special way of keeping a cluster consistent, very different from the other stonith methods. I wouldn't expect it under stonith, I'd rather think... Yes no-quorum-policy=suicide means that all nodes in the partition will end up being shot, but you still require a real stonith device so that _someone_else_ can perform it. ...that if you set no-quorum-policy=suicide, the suicide script is executed by the node itself. It should be an *extra* feature *besides* stonith. The procedure should be something like: 1) node1: Allright, I have no quorum anymore. Let's wait for a while... 2)... a while passes 3) node1: OK, I'm still without quorum, no contact to my peers, whatsoever. I'd rather shut myself down, before I cause a mess. If, during (2), the other nodes find a way to shut down the node externaly (if through ssh, a power switch, a virtualisation host...), that's even better, because then the cluster knows, that it's still consistent. I'm with you, here. If a split brain happens in a split site scenario, a suicide might be the only way to keep up consistency, because no one will be able to reach any device on the other site... Please correct me if I'm wrong. What do you do in such a case? What's your exemplary implementation of Linux-HA then? On the other hand, it doen't make any other sense to name a no-quorum-policy suicide, if it's anything, but a suicide (if, at all, one could name it assisted suicide). Please correct me: Do I have a utterly wrong understanding of the whole process (that could be very well the case), is the implementation not entirely thought through, or is the naming of certain components not as good as it could be? I might point you to http://osdir.com/ml/linux.highavailability.devel/2007-11/msg00026.html, because the same thing has been discussed then, and I very much do think, that Lars was right with what he wrote. Has anything changed in the concept of suicide/quorum-loss/stonith since then? That's not a provocative question, well, maybe it is, but it's not meant to be. In addition: Something that's missing from the manuals is a case study (or something the like) on how to implement a split side scenario. How should the cluster be build then? If you have to sides? If you have one? How should the storage-replication be set up? Is synchronous replication like in drbd really a good idea then, performance wise? I think I'll finally have to buy a book. :-) Any recommendations (either english or german prefered). Well, thank's a lot again, my brain didn't explode (that's something good, I feel), but I'm not entirely happy, though. Cheers and have a nice weekend, Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman
Re: [Linux-ha-dev] new resource agents repository
On Thu, Feb 24, 2011 at 4:10 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Thu, Feb 24, 2011 at 03:56:27PM +0100, Andrew Beekhof wrote: On Thu, Feb 24, 2011 at 2:59 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hello, There is a new repository for Resource Agents which contains RA sets from both Linux HA and Red Hat projects: git://github.com/ClusterLabs/resource-agents.git The purpose of the common repository is to share maintenance load and try to consolidate resource agents. There were no conflicts with the rgmanager RA set and both source layouts remain the same. It is only that autoconf bits were merged. The only difference is that if you want to get Linux HA set of resource agents installed, configure should be run like this: configure --with-ras-set=linux-ha ... The new repository is git but the existing history is preserved. People used to Mercurial shouldn't have hard time working with git. We need to retire the existing repository hg.linux-ha.org. Are there any objections or concerns that still need to be addressed? Might not hurt to leave it around - there might be various URLs that point there. Yes, it will definitely remain there. What I meant with retire, is that the developers then start using the git repository exclusively. A Yes, and making read-only on the server it probably a good idea (to avoid pushes). ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Problems starting apache
On Thu, Feb 24, 2011 at 3:31 PM, Dan Frincu df.clus...@gmail.com wrote: Hi, On 02/24/2011 04:24 PM, Stallmann, Andreas wrote: Hi! First: I set up my configuration anew, and it works. I didn't change that much, just set the monitor-action differently from before. Instead of: webserver_ressource ocf:heartbeat:apache \ params httpd=/usr/sbin/httpd2-prefork \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s \ op monitor interval=10 timeout=20s depth=0 \ meta target-role=Started I have now: primitive web_res ocf:heartbeat:apache \ params configfile=/etc/apache2/httpd.conf \ params httpd=/usr/sbin/httpd2-prefork \ op start interval=0 timeout=40s \ op stop interval=0 timeout=60s \ op monitor interval=1min As you can see, I added the configfile. This obviously did it, because when I gave the logs a closer look I found: Feb 24 10:43:22 mgmt-01 apache[1191]: ERROR: httpd2-prefork: option requires an argument -- f Somehow the ocf:heartbeat:apache did not supply the default-configfile. Thus, you have to supply the configfile. Well... The answer should be in the logs. Your right, that's where it was. I somehow got lost in the vast amount of logs... grep -i error | grep -i apache That did it. Is it, by the way possible, to influence the logging in any way? Make it more verbose, or redirect the logs to a different file (without using filtering in syslog)? If you're using corosync, then in /etc/corosync/corosync.conf. logging { debug: off fileline: off to_syslog: yes to_stderr: no syslog_facility: local7 timestamp: on to_logfile: yes logfile: /var/log/cluster/corosync.log logger_subsys { subsys: AMF debug: off } } And in syslog.conf you put local7.* /var/log/cluster/corosync.log And it logs to a file. For anything else other than corosync, someone else can reply. This will tell Pacemaker to also the same logging setup, so no additional steps there. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Thu, Feb 24, 2011 at 2:49 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi! TNX for your answer. We will switch to sbd after the shared storage has been set up. For now, I need an interim solution, which is, as of now, stonith via suicide. Doesn't work as suicide is not considered reliable - by definition the remaining nodes have no way to verify that the fencing operation was successful. Yes no-quorum-policy=suicide means that all nodes in the partition will end up being shot, but you still require a real stonith device so that _someone_else_ can perform it. My configuration doesn't work, though. I tried: ~~Output from crm configure show~~ primitive suicide_res stonith:suicide ... clone fenc_clon suicide_res ... property $id=cib-bootstrap-options \ dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=true \ no-quorum-policy=suicide \ stonith-action=poweroff If I disconnect one node from the network, crm_mon shows: Current DC: mgmt03 - partition WITHOUT quorum ... Node mgmt01: UNCLEAN (offline) Node mgmt02: UNCLEAN (offline) Online: [ mgmt03 ] Clone Set: fenc_clon Started: [ ipfuie-mgmt03 ] Stopped: [ suicide_res:0 suicide_res:1 ] ~~~ No action, neither reboot nor poweroff is taken. 1. What did I do wrong here? 2. OK, let's be more precise: I have the feeling, that the suicide ressource should be in a default state of stopped (on all nodes) and should only be started on the node, which has to fence itself. Am I right? And, if yes, how is that accomplished? 3. How does the no-quorum-policy relate to the stonith-ressources? I didn't find any documentation, if the two have any connection at all. 4. Am I correct, that the no-quorum-policy is what a node (or a cluster partition) should do to itself, when it looses quorum (for example, shut down itself), and stonith is what the nodes with quorum try to do to the nodes without? 5. Shouldn't then no-quorum-policy=suicide be obsolet in case of suicide as stonith-method? TNX for your help (again), Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] CLVM cmirror using Pacemaker DLM integration on rhel 6
On Thu, Feb 17, 2011 at 3:34 PM, Pieter Baele pieter.ba...@gmail.com wrote: Hi, With our last cluster experiments we try to set up Pacemaker with CLVM mirroring on RHEL 6.0 I added a DLM resource, but when I try to add clvm in crm, I get the following error: crm(live)configure# primitive clvm ocf:lvm2:clvmd params daemon_timeout=30 op monitor interval=60 timeout=60 ERROR: ocf:lvm2:clvmd: could not parse meta-data: ERROR: ocf:lvm2:clvmd: no such resource agent Any idea what's missing? The ocf:lvm2:clvmd resource agent perhaps? Is there a short guide/howto somewhere how to set this up? I don't know of one personally Last updated: Thu Feb 17 15:32:12 2011 Stack: openais Current DC: xxx - partition with quorum Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ xyz xyz ] ClusterIP (ocf::heartbeat:IPaddr2): Started x dlm (ocf::pacemaker:controld): Started x ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] New ocft config file for IBM db2 resource agent
On Tue, Feb 15, 2011 at 10:50 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi Holger, On Tue, Feb 15, 2011 at 09:49:07AM +0100, Holger Teutsch wrote: Hi, please find enclosed an ocft config for db2 for review and inclusion into the project if appropriate. Wonderful! This is the first time somebody contributed an ocft testcase. Looks like lmb owes somebody lunch :-) The current 1.0.4 agent passes the tests 8-) . I've never doubted that either. Cheers, Dejan Regards Holger # db2 # # This test assumes a db2 ESE instance with two partions and a database. # Default is instance=db2inst1, database=ocft # adapt this in set_testenv below # # Simple steps to generate a test environment (if you don't have one): # # A virtual machine with 1200MB RAM is sufficient # # - download an eval version of DB2 server from IBM # - create an user db2inst1 in group db2inst1 # # As root # - install DB2 software in some location # - create instance # cd this_location/instance # ./db2icrt -s ese -u db2inst1 db2inst1 # - adapt profile of db2inst1 as instructed by db2icrt # # As db2inst1 # # allow to run with small memory footprint # db2set DB2_FCM_SETTINGS=FCM_MAXIMIZE_SET_SIZE:FALSE # db2start # db2start dbpartitionnum 1 add dbpartitionnum hostname $(uname -n) port 1 without tablespaces # db2stop # db2start # db2 create database ocft # Done # In order to install a real cluster refer to http://www.linux-ha.org/wiki/db2_(resource_agent) CONFIG HangTimeout 40 SETUP-AGENT # nothing CASE-BLOCK set_testenv Var OCFT_instance=db2inst1 Var OCFT_db=ocft CASE-BLOCK crm_setting Var OCF_RESKEY_instance=$OCFT_instance Var OCF_RESKEY_CRM_meta_timeout=3 CASE-BLOCK default_status AgentRun stop CASE-BLOCK prepare Include set_testenv Include crm_setting Include default_status CASE check base env Include prepare AgentRun start OCF_SUCCESS CASE check base env: invalid 'OCF_RESKEY_instance' Include prepare Var OCF_RESKEY_instance=no_such AgentRun start OCF_ERR_INSTALLED CASE invalid instance config Include prepare Bash eval mv ~$OCFT_instance/sqllib ~$OCFT_instance/sqllib- BashAtExit eval mv ~$OCFT_instance/sqllib- ~$OCFT_instance/sqllib AgentRun start OCF_ERR_INSTALLED CASE unimplemented command Include prepare AgentRun no_cmd OCF_ERR_UNIMPLEMENTED CASE normal start Include prepare AgentRun start OCF_SUCCESS CASE normal stop Include prepare AgentRun start AgentRun stop OCF_SUCCESS CASE double start Include prepare AgentRun start AgentRun start OCF_SUCCESS CASE double stop Include prepare AgentRun stop OCF_SUCCESS CASE started: monitor Include prepare AgentRun start AgentRun monitor OCF_SUCCESS CASE not started: monitor Include prepare AgentRun monitor OCF_NOT_RUNNING CASE killed instance: monitor Include prepare AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS BashAtExit rm /tmp/ocft-helper1 Bash echo su $OCFT_instance -c '. ~$OCFT_instance/sqllib/db2profile; db2nkill 0 /dev/null 21' /tmp/ocft-helper1 Bash sh -x /tmp/ocft-helper1 AgentRun monitor OCF_NOT_RUNNING CASE overload param instance by admin Include prepare Var OCF_RESKEY_instance=no_such Var OCF_RESKEY_admin=$OCFT_instance AgentRun start OCF_SUCCESS CASE check start really activates db Include prepare AgentRun start OCF_SUCCESS BashAtExit rm /tmp/ocft-helper2 Bash echo su $OCFT_instance -c '. ~$OCFT_instance/sqllib/db2profile; db2 get snapshot for database on $OCFT_db/dev/null' /tmp/ocft-helper2 Bash sh -x /tmp/ocft-helper2 CASE multipartion test Include prepare AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS # start does not start partion 1 Var OCF_RESKEY_dbpartitionnum=1 AgentRun monitor OCF_NOT_RUNNING # now start 1 AgentRun start OCF_SUCCESS AgentRun monitor OCF_SUCCESS # now stop 1 AgentRun stop OCF_SUCCESS AgentRun monitor OCF_NOT_RUNNING # does not affect 0 Var OCF_RESKEY_dbpartitionnum=0 AgentRun monitor OCF_SUCCESS # fault injection does not work on the 1.0.4 client due to a hardcoded path CASE simulate hanging db2stop (not meaningful for 1.0.4 agent) Include prepare AgentRun start OCF_SUCCESS Bash [ ! -f /usr/local/bin/db2stop ] BashAtExit rm /usr/local/bin/db2stop Bash echo -e #!/bin/sh\necho fake db2stop\nsleep 1 /usr/local/bin/db2stop Bash chmod +x /usr/local/bin/db2stop AgentRun stop OCF_SUCCESS #
Re: [Linux-ha-dev] [PATCH] manage PostgreSQL 9.0 streaming replication using Master/Slave
On Mon, Feb 14, 2011 at 8:46 PM, Serge Dubrouski serge...@gmail.com wrote: On Mon, Feb 14, 2011 at 1:28 AM, Takatoshi MATSUO matsuo@gmail.com wrote: Ideally demote operation should stop a master node and then restart it in hot-standby mode. It's up to administrator to make sure that no node with outdated data gets promoted to the master role. One should follow standard procedures: cluster software shouldn't be configured for autostart at the boot time, administrator has to make sure that data was refreshed if the node was down for some prolonged time. Hmm.. Do you mean that RA puts recovery.conf automatically at demote op to start hot standby? Please give me some time to think it over. Sorry, I got the wrong idea about restoring data. To start as hot-standby needs restoring anytime, because Time-line ID of PostgreSQL is incremented. In addition, shutting down the PostgreSQL with immediate option causes inconsistent WAL between primary and hot-standby. So I think it's difficult to start slave automatically at demote. Still, do you think it's better to implement restoring ? I'm afraid it's not just better, but it's a must. We have to play by Pacemaker's rules and that means that we have to properly implement demote operation and that's switching from Master to Slave, not just stopping Master. I do appreciate your efforts, but implementation has to conform to Pacemaker standards, i.e. Master has to start where it's configured in Pacemaker, not just where recovery.conf file exists. Thats the ideal at least. Most of the time it should be possible to self-promote and let pacemaker figure out the result. But I can easily imagine there would also be situations where this is going blow up in your face. Administrator has to be able to easily switch between node roles and so on. I still need some more time to learn PostgreSQL data replication and do some tests. Let's think if that's possible to implement real Master/Slave in Pacemaker sense of things. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data
On Fri, Feb 4, 2011 at 11:23 AM, Brett Delle Grazie brett.dellegra...@gmail.com wrote: Hi, Apologies for cross-posting but I'm not sure where this problem resides. I'm running: corosync-1.2.7-1.1.el5.x86_64 corosynclib-1.2.7-1.1.el5.x86_64 cluster-glue-1.0.6-1.6.el5.x86_64 cluster-glue-libs-1.0.6-1.6.el5.x86_64 pacemaker-1.0.10-1.4.el5.x86_64 pacemaker-libs-1.0.10-1.4.el5.x86_64 resource-agents-1.0.3-2.6.el5.x86_64 on RHEL5. In one of my resource agents (tomcat) I'm directly outputting the result of: $((OCF_RESKEY_CRM_meta_timeout/1000)) to an external file. and its coming up with a value of '100' Whereas the resource definition in pacemaker specifies timeout of '30' specifically: primitive tomcat_tc1 ocf:intact:tomcat \ params tomcat_user=tomcat catalina_home=/opt/tomcat6 catalina_pid=/home/tomcat/tc1/temp/tomcat.pid catalina_rotate_log=NO script_log=/home/tomcat/tc1/logs/tc1.log statusurl=http://127.0.0.1/version/; java_home=/usr/lib/jvm/java \ op start interval=0 timeout=70 \ op stop interval=0 timeout=20 \ op monitor interval=60 timeout=30 start-delay=70 Is this a known bug? No. Could you file a bug please? Does it affect all operation timeouts? Unknown Thanks, -- Best Regards, Brett Delle Grazie ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] A bunch of thoughts/questions about heartbeat network(s)
On Tue, Jan 25, 2011 at 8:15 AM, Alain.Moulle alain.mou...@bull.net wrote: Hi A bunch of thoughts/questions about heartbeat network(s) : In the following, when I talk about two heartbeat networks , I'm talking about two physically different networks set in the corosync.conf as two different ring-number (with rrp_mode set to active). 1/ with a 2-nodes HA cluster, it is recommended to have two heartbeat networks to avoid the race for fencing, or even the dual-fencing in case of problem on this heartbeat network. But with a more-than-2-nodes HA cluster, is it always worthwhile to have two heartbeat networks ? My understanding is that if one node can't have contact from other nodes in the cluster due to a heartbeat network problem, as it is isolated, it does not have quorum and so is not authorized to fence any other node, whereas other nodes have quorum and so will decide to fence the node with problem. Right ? Right, but wouldn't it be better to have no need to shoot anyone? So is there any other advantage to have more than 2 heartbeats networks in a more-than-2-nodes HA cluster ? 2/ if the future of the HA stack for Pacemaker is option 4 (corosync + cpg + cman + mcp), Option 4 does not involve cman meaning that cluster manager configuration parameters will all be in cluster.conf and nothing more in corosync.conf (again that's my understanding...) , Other way around, cluster.conf is going away (like cman) not corosync.conf from memory there is any possibility to set two heartbeat networks in cluster.conf (Cluster Suite from RH was working only on 1 heartbeat network and if one wanted to work on 2 hearbeat netwoks he has to configure a bonding solution). Am I right when I write no possibility of 2 hb networks with stack option 4 ? No Thanks a lot for your responses, and tell me if some of my understanding is not right ... Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] One-Node-Cluster
On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Andrew Beekhof and...@beekhof.net schrieb am 14.02.2011 um 10:08 in Nachricht aanlktinuc9_oqpwjubxrdmqkncqvnqx68a_1kbqss...@mail.gmail.com: [...] The log just keeps on saying: Feb 8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled Exactly. Read that line again a couple of times, then read clusters from scratch. [...] Which makes me wonder: Can a one-node-cluster ever have a quorum? Not really, which is why we have no-quorum-policy. I think a one-node-cluster is a completely valid construct. Also with Linux-HA? Yep. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] One-Node-Cluster
On Tue, Feb 15, 2011 at 6:08 AM, Alan Robertson al...@unix.sh wrote: On 02/14/2011 04:45 AM, Andrew Beekhof wrote: On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Andrew Beekhofand...@beekhof.net schrieb am 14.02.2011 um 10:08 in Nachricht aanlktinuc9_oqpwjubxrdmqkncqvnqx68a_1kbqss...@mail.gmail.com: [...] The log just keeps on saying: Feb 8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled Exactly. Read that line again a couple of times, then read clusters from scratch. [...] Which makes me wonder: Can a one-node-cluster ever have a quorum? Not really, which is why we have no-quorum-policy. I think a one-node-cluster is a completely valid construct. Also with Linux-HA? Yep. If you're using the Heartbeat membership stack, then it is perfectly happy to give you quorum in a one-node cluster. Or a two node cluster. Which is not exactly ideal. In fact, at one tmie I wrote a script to create a cluster configuration from your /etc/init.d/ scripts - so that Pacemaker could be effectively a nice replacement for init - with a respawn that really works ;-) -- Alan Robertsonal...@unix.sh Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] ocft: status vs. monitor
On Sun, Feb 13, 2011 at 11:01 AM, Holger Teutsch holger.teut...@web.de wrote: Hi, to my knowledge OCF *requires* a method monitor while status is optional (or what is it really for? heritage, compatibility, ...) Shouldn't the ocft configs check for status ? Yes, unless its trying to talk to an LSB resource. -holger diff -r 722c8a7a03e9 tools/ocft/apache --- a/tools/ocft/apache Fri Feb 11 18:49:09 2011 +0100 +++ b/tools/ocft/apache Sun Feb 13 10:57:50 2011 +0100 @@ -52,14 +52,14 @@ Include prepare AgentRun stop OCF_SUCCESS -CASE running status +CASE running monitor Include prepare AgentRun start - AgentRun status OCF_SUCCESS + AgentRun monitor OCF_SUCCESS -CASE not running status +CASE not running monitor Include prepare - AgentRun status OCF_NOT_RUNNING + AgentRun monitor OCF_NOT_RUNNING CASE unimplemented command Include prepare diff -r 722c8a7a03e9 tools/ocft/mysql --- a/tools/ocft/mysql Fri Feb 11 18:49:09 2011 +0100 +++ b/tools/ocft/mysql Sun Feb 13 10:57:50 2011 +0100 @@ -46,14 +46,14 @@ Include prepare AgentRun stop OCF_SUCCESS -CASE running status +CASE running monitor Include prepare AgentRun start - AgentRun status OCF_SUCCESS + AgentRun monitor OCF_SUCCESS -CASE not running status +CASE not running monitor Include prepare - AgentRun status OCF_NOT_RUNNING + AgentRun monitor OCF_NOT_RUNNING CASE check lib file Include prepare ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Documenting ocf:pacemaker:ping
On Thu, Feb 10, 2011 at 9:14 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! I'm getting into Linux-HA, and it seems the documentation was made with a very hot needle. For example, ocf:pacemaker:ping has the following documentation (crm(live)ra# info ocf:ping in SLES11 SP3+Updates): crm(live)ra# info ocf:ping PID file dampen (integer, [5s]): Dampening interval The time to wait (dampening) further changes occur The sentence above is grammatically wrong. At least its spelt correctly, which is more than people usually get from me. name (string, [pingd]): Attribute name The name of the attributes to set. This is the name to be used in the constraints. multiplier (integer): Value multiplier The number by which to multiply the number of connected ping nodes by Please explain the reason (semantics) for the multiplication! Please read Pacemaker Explained host_list* (string): Host list The list of ping nodes to count. to count, or to ping, or both? My, we are pedantic today. To ping and count towards the number of active hosts attempts (integer, [2]): no. of ping attempts Number of ping attempts, per host, before declaring it dead timeout (integer, [2]): ping timeout in seconds How long, in seconds, to wait before declaring a ping lost a ping lost == a host dead? Yes Or is it the the monitor reports a failure? No options (string): Extra Options A catch all for any other options that need to be passed to ping. What about Additional options to be passed to ping? This text is too short, this text is too long... are you ever happy? debug (string, [false]): Verbose logging Enables to use default attrd_updater verbose logging on every call. What about `true' enables verbose logging? Operations' defaults (advisory minimum): start timeout=60 stop timeout=20 reload timeout=100 monitor_0 interval=10 timeout=60 Is it really monitor_0? What is that _0? I'm guessing a typo Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Documenting ocf:pacemaker:ping
On Thu, Feb 10, 2011 at 12:28 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Thu, Feb 10, 2011 at 09:29:41AM +0100, Andrew Beekhof wrote: On Thu, Feb 10, 2011 at 9:14 AM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! I'm getting into Linux-HA, and it seems the documentation was made with a very hot needle. For example, ocf:pacemaker:ping has the following documentation (crm(live)ra# info ocf:ping in SLES11 SP3+Updates): crm(live)ra# info ocf:ping PID file dampen (integer, [5s]): Dampening interval The time to wait (dampening) further changes occur The sentence above is grammatically wrong. At least its spelt correctly, which is more than people usually get from me. name (string, [pingd]): Attribute name The name of the attributes to set. This is the name to be used in the constraints. multiplier (integer): Value multiplier The number by which to multiply the number of connected ping nodes by Please explain the reason (semantics) for the multiplication! Please read Pacemaker Explained host_list* (string): Host list The list of ping nodes to count. to count, or to ping, or both? My, we are pedantic today. To ping and count towards the number of active hosts attempts (integer, [2]): no. of ping attempts Number of ping attempts, per host, before declaring it dead timeout (integer, [2]): ping timeout in seconds How long, in seconds, to wait before declaring a ping lost a ping lost == a host dead? Yes Or is it the the monitor reports a failure? No options (string): Extra Options A catch all for any other options that need to be passed to ping. What about Additional options to be passed to ping? This text is too short, this text is too long... are you ever happy? debug (string, [false]): Verbose logging Enables to use default attrd_updater verbose logging on every call. What about `true' enables verbose logging? Operations' defaults (advisory minimum): start timeout=60 stop timeout=20 reload timeout=100 monitor_0 interval=10 timeout=60 Is it really monitor_0? What is that _0? I'm guessing a typo Not a typo, but monitor with depth check 0. Well, perhaps appending depth could be skipped in this case. Oh, I mistook it for a non-recurring monitor op ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote: On 2011-02-09 11:56, Dejan Muhamedagic wrote: It is plugin compatible to the old version of the agent. Great! Unfortunately, we can't replace the old db2 now, the number of changes is very large: db2 | 1076 +++- 1 file changed, 687 insertions(+), 389 deletions(-) And the code is completely new (though I have no doubt that it is of excellent quality). So, I'd suggest to add this as another db2 RA. Once it gets some field testing we can mark the old one as deprecated. What name would you suggest? db2db2? Just making sure: Is that a joke? A bit of a joke, yes. But the alternatives such as db22 or db2new looked a bit boring. I think boring is the least of our problems with those names. Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 HADR is a very different beast from non-HADR db, right? Why not then add the hadr boolean parameter and use that instead of checking if the resource has been configured as multi-state? I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd be curious to find out what you think is wrong with that approach. There's nothing wrong in the sense whether it is going to work. But someday, db2 may sport say HADR2 or VHA or whatever else which may run as a ms resource. I just think that it's better to make it obvious in the configuration that the user runs HADR. Does that make sense? Because if anything is, then the mysql RA needs fixing too. No idea what's up with mysql. Cheers, Dejan Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 2:17 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi Andrew, On Wed, Feb 09, 2011 at 01:33:03PM +0100, Andrew Beekhof wrote: On Wed, Feb 9, 2011 at 12:15 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: On Wed, Feb 09, 2011 at 12:06:04PM +0100, Florian Haas wrote: On 2011-02-09 11:56, Dejan Muhamedagic wrote: It is plugin compatible to the old version of the agent. Great! Unfortunately, we can't replace the old db2 now, the number of changes is very large: db2 | 1076 +++- 1 file changed, 687 insertions(+), 389 deletions(-) And the code is completely new (though I have no doubt that it is of excellent quality). So, I'd suggest to add this as another db2 RA. Once it gets some field testing we can mark the old one as deprecated. What name would you suggest? db2db2? Just making sure: Is that a joke? A bit of a joke, yes. But the alternatives such as db22 or db2new looked a bit boring. I think boring is the least of our problems with those names. Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one I don't think it is going to happen that often. It happens often enough - its just normally by a core developer. And realistically, almost every RA is going to get similar treatment (over time) as they're merged with the Red Hat ones. Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. That would be for the best, but it takes time. We may opt for it, but I wanted to add the this agent to the new release. Understood - but I think the long-term pain that is created outweighs any perceived benefit in the short-term. Also, it is very seldom that people test anything which is not contained in the release. Unless there's no alternative as was the case with conntrac. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 What do you suggest? That we add to the release announcement: The db2 RA has been rewritten and didn't get yet a lot of field testing. Please help test it. So don't do that :-) Put up a wiki page with instructions for how to download+use the new agent and give feedback. If the new version is significantly better, you're going to hear people pleading for its inclusion pretty soon. But, if you want to keep the old agent, download the old one from the repository and use it instead of the new one. And don't forget to do the same when installing the next resource-agents release. At any rate, I wouldn't want to take responsibility for replacing the existing (and working RA) with a completely new and not yet tested code. Call me coward :) I wouldn't either - which is why I keep saying test then replace :-) Another alternative, create a testing provider... not sure if its a good idea or not, just putting it out there. Finally, I expected that the new functionality is going to be added without much changes to the existing code. But it turned out to be a rewrite. Cheers, Dejan HADR is a very different beast from non-HADR db, right? Why not then add the hadr boolean parameter and use that instead of checking if the resource has been configured as multi-state? I'll take responsibility for suggesting the use of ocf_is_ms(), and I'd be curious to find out what you think is wrong with that approach. There's nothing wrong in the sense whether it is going to work. But someday, db2 may sport say HADR2 or VHA or whatever else which may run as a ms resource. I just think that it's better to make it obvious in the configuration that the user runs HADR. Does that make sense? Because if anything is, then the mysql RA needs fixing too. No idea what's up with mysql. Cheers, Dejan Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http
Re: [Linux-ha-dev] New master/slave resource agent for DB2 databases in HADR (High Availability Disaster Recovery) mode
On Wed, Feb 9, 2011 at 3:35 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Feb 09, 2011 at 02:43:17PM +0100, Andrew Beekhof wrote: Are you going to change the name of every agent that gets a rewrite? IPaddr2-ng-ng-again-and-one-more-plus-one I don't think it is going to happen that often. It happens often enough - its just normally by a core developer. And realistically, almost every RA is going to get similar treatment (over time) as they're merged with the Red Hat ones. Solicit feedback, like was done for kliend's new agent, and replace the existing one it if/when people respond positively. Its not like the old one disappears from the face of the earth after you merge the new one. wget -o /usr/lib/ocf/resource.d/heartbeat/db2 http://hg.linux-ha.org/agents/file/agents-1.0.3/heartbeat/db2 What do you suggest? That we add to the release announcement: The db2 RA has been rewritten and didn't get yet a lot of field testing. Please help test it. So don't do that :-) Put up a wiki page with instructions for how to download+use the new agent and give feedback. How about a staging area? /usr/lib/ocf/resource.d/staging/ I was thinking along the same lines when I said testing. Either name works for me :-) we can also add a /usr/lib/ocf/resource.d/deprecated/ The thing in .../heartbeat/ can become a symlink, and be given config file status by the package manager? Something like that. So we have it bundled with the release, it is readily available without much go to that web page and download and save to there and make executable and then blah. It would simply pop up in crm shell and DRBD-MC and so on. We can add please give feedback to the description, and this will replace the current RA with release + 2 unless we get veto-ing feedback to the release notes. Once settled, we copy over the staging one to the real directory, replacing the original one, and add a please fix your config to the thing that remains in staging/, so we will be able to start a further rewrite with the next merge window. * does not break existing setups * new RAs and rewrites are readily available -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] how to configure heartbeat for polling MySql server?
On Wed, Feb 2, 2011 at 5:28 PM, Danilo danilo.abbasci...@gmail.com wrote: On 02/01/2011 09:32 AM, Cristian Mammoli - Apra Sistemi wrote: On 01/31/2011 10:08 AM, Danilo Abbasciano wrote: If the running node is rebooted the cluster works and the service will switched to the other node. But if I stop MySql, the cluster don't switch or try to restart it. How to configure the cluster to check if MySql service is alive? Use the mysql ocf resource agent, not the lsb init script. Then configure monitoring operations. Hi! Thanks for your help. I found a good documentation here http://www.linux-ha.org/wiki/Mysql_%28resource_agent%29 that sounds good. But I have another problem. I have an old version of heartbeat, my version is heartbeat-2.1.4-4.1 I don't have the crm management program to configure primitives. But I have crm_sh, could I use it instead of crm? And How? The cluster is on a critical system and I prefer don't update it. If it really is a critical system, then, as the author of the crm, I beg you to update it. Thanks in advance. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource stickness question
On Wed, Feb 9, 2011 at 9:05 AM, Erik Dobák erik.do...@gmail.com wrote: i have 2 nodes which are running 2 resource groups in an active/passive cluster. my goal was to run 1 resource active on lc-cl1 and the other resource on node lc-cl2. this is how i configured it: group bamcluster ipaddr2 lcbam \ meta target-role=Started is-managed=true group bamclusteruat ipaddr2uat lcbamuat location primarylc bamcluster 100: lc-cl1 location primarylcuat bamclusteruat 100: lc-cl2 location secondarylc bamcluster 50: lc-cl2 location secondarylcuat bamclusteruat 50: lc-cl1 rsc_defaults $id=rsc-options \ resource-stickiness=75 but after adding the second resource (bamclusteruat) did also try to start at lc-cl1 of course, both resources prefer that host the most. only if primarylcuat is down will they run elsewhere. what did i mess up? check up on colocation constraints (with a negative score) E ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems