Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated
Oops, this is actually a bug in fence_ipmilan which reports all params as unique. 26.03.2011 08:28, Vladislav Bogdanov wrote: > Hi, > > it seems like it was commit d0472a26eda1 which now causes following: > > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "action": "reboot" > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "auth": "md5" > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "lanplus": "true" > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "method": "onoff" > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "passwd": "" > WARNING: Resources > stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate > uniqueness for parameter "login": "" > > That resources are fence_ipmilan. > > Best, > Vladislav > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated
Hi, it seems like it was commit d0472a26eda1 which now causes following: WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "action": "reboot" WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "auth": "md5" WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "lanplus": "true" WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "method": "onoff" WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "passwd": "" WARNING: Resources stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate uniqueness for parameter "login": "" That resources are fence_ipmilan. Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg
A few more thoughts that occurred after I hit 1. This problem sees to only occur when "/etc/init.d/heartbeat start" is executed on two nodes at the same time. If I only do one at a time it does not seem to occur. (this may be related to the creation of master/slave resources in /etc/ha.d/resource.d/startstop when heartbeat starts) 2. This problem seemed to occur most frequently when I went from 4 master/slave resources to 6 master/slave resources. Thanks, Bob - Original Message From: Bob Schatz To: The Pacemaker cluster resource manager Sent: Fri, March 25, 2011 4:22:39 PM Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg After reading more threads, I noticed that I needed to include the PE outputs. Therefore, I have rerun the tests and included the PE outputs, the configuration file and the logs for both nodes. The test was rerun with max-children of 20. Thanks, Bob - Original Message From: Bob Schatz To: pacemaker@oss.clusterlabs.org Sent: Thu, March 24, 2011 7:35:54 PM Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg I am getting these messages in the log: 2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg 2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows: 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [lrm_rid=SSJE02A2:0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=30] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [(2)lrm_param=0x64c230(938 1098)] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [CRM_meta_notify_slave_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [CRM_meta_notify_active_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [CRM_meta_notify_demote_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [CRM_meta_notify_inactive_resource=SSJE02A2:0 SSJE02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [ssconf=/var/omneon/config/config.JE02A2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [CRM_meta_master_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [CRM_meta_notify_stop_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [CRM_meta_notify_master_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [CRM_meta_clone_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [CRM_meta_clone_max=2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [CRM_meta_notify=true] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [CRM_meta_notify_start_resource=SSJE02A2:0 SSJE02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [CRM_meta_notify_stop_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [crm_feature_set=3.0.1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [CRM_meta_notify_master_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : [CRM_meta_master_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : [CRM_meta_globally_unique=false] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : [CRM_meta_notify_promote_resource=SSJE02A2:0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : [CRM_meta_notify_promote_uname=mgraid-se02a1-0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : [CRM_meta_notify_active_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : [CRM_meta_notify_start_uname=mgrai
Re: [Pacemaker] DRBD and pacemaker interaction
On Fri, Mar 25, 2011 at 06:39:10PM +0100, Christoph Bartoschek wrote: > Hi, > > I´ve already sent this mail to linux-ha but that list seems to be dead: What makes you think so? That you did not get a reply within 40 minutes? You make me feel sorry about having replied there. Maybe you should consider to sign a contract with defined SLAs ;-) > we experiment with DRBD and pacemaker and see several times that the > DRBD part is degraded (One node is outdated or diskless or something > similar) but crm_mon just reports that the DRBD resource runs as master > and slave on the nodes. > > There is no indication that the resource is not in its optimal mode of > operation. > > For me it seems as if pacemaker knows only the states: running, stopped, > failed. > > I am missing the state: running degraded or suboptimal. > > Is it already there and I have made an configuration error? Or what is > the recommended way to check the sanity of the resources controlled by > pacemaker? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue
25.03.2011 18:47, darren.mans...@opengi.co.uk: We configure a virtual IP on the non-arping lo interface of both servers and then configure the IPaddr2 resource with lvs_support=true. This RA will remove the duplicate IP from the lo interface when it becomes active. Grouping the VIP with ldirectord/LVS we can have the load-balancer and VIP on one node, balancing traffic to the other node with failover where both resources failover together. To do this we need to configure the VIP on lo as a 32 bit netmask but the VIP on the eth0 interface needs to have a 24 bit netmask. This has worked fine up until now and we base all of our clusters on this method. Now what happens is that the find_interface() routine in IPaddr2 doesn't remove the IP from lo when starting the VIP resource as it can't find it due to the netmask not matching. Do you really need the address to be deleted from lo? Having two identical addresses on the Linux machine should not harm, if routing was not affected. In your case, with /32 netmask on lo, I do not foresee any problems. We use it in this way, i.e. with the address set on lo permanently. -- Pavel Levshin ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Is there any way to reduce the time for migration of the resource from one node to another node in a cluster on failover.
Andrew Beekhof writes: Hi Andrew Beekhof I measured the time, when heart beat recognize the process had failed on first node to the process and VIP has created on the second node . i calculated the time using the heartbeat log files which logs the messages with time. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] DRBD and pacemaker interaction
Hi, I´ve already sent this mail to linux-ha but that list seems to be dead: we experiment with DRBD and pacemaker and see several times that the DRBD part is degraded (One node is outdated or diskless or something similar) but crm_mon just reports that the DRBD resource runs as master and slave on the nodes. There is no indication that the resource is not in its optimal mode of operation. For me it seems as if pacemaker knows only the states: running, stopped, failed. I am missing the state: running degraded or suboptimal. Is it already there and I have made an configuration error? Or what is the recommended way to check the sanity of the resources controlled by pacemaker? Christoph ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] IPaddr2 Netmask Bug Fix Issue
Hello all. Between SLE 11 HAE and SLE 11 SP1 HAE (pacemaker 1.0.3 - pacemaker 1.1.2) the following bit has changed in the IPaddr2 RA: Old: local iface=`$IP2UTIL -o -f inet addr show | grep "\ $BASEIP/" \ | cut -d ' ' -f2 | grep -v '^ipsec[0-9][0-9]*$'` New: local iface=`$IP2UTIL -o -f inet addr show | grep "\ $BASEIP/$NETMASK" \ | cut -d ' ' -f2 | grep -v '^ipsec[0-9][0-9]*$'` I notice the addition of the $NETMASK variable. I'm not sure why it's been added but it's broken how we do load balancing. We configure a virtual IP on the non-arping lo interface of both servers and then configure the IPaddr2 resource with lvs_support=true. This RA will remove the duplicate IP from the lo interface when it becomes active. Grouping the VIP with ldirectord/LVS we can have the load-balancer and VIP on one node, balancing traffic to the other node with failover where both resources failover together. To do this we need to configure the VIP on lo as a 32 bit netmask but the VIP on the eth0 interface needs to have a 24 bit netmask. This has worked fine up until now and we base all of our clusters on this method. Now what happens is that the find_interface() routine in IPaddr2 doesn't remove the IP from lo when starting the VIP resource as it can't find it due to the netmask not matching. Obviously I can edit the RA myself but I wanted to check the reason for this. Apologies if it's in changelogs somewhere (please direct me to these if so). Thanks Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening
On 03/25/2011 11:10 AM, Andrew Beekhof wrote: > On Thu, Mar 17, 2011 at 11:54 PM, Simone Gotti wrote: >> Hi, >> >> When using corosync + pcmk v1 starting both corosync and pacemakerd (and >> I think also using heartbeat or anything other than cman) as quorum >> provider, at startup in the CIB will not be a entry for >> the nodes that are not in cluster. > No, I'm pretty sure heartbeat has the same behavior. I didn't tested it bit if it works like cman then I think that startup-fencing won't work also on it. But this will be very strange. >> Instead when using cman as quorum provider there will be a >> for every node known by cman as lib/common/ais.c:cman_event_callback >> calls crm_update_peer for every node reported by cman_get_nodes. > Yep > >> Something similar will happen when using corosync+pcmkv1 if corosync is >> started on N nodes but pacemakerd is started only on N-M nodes. > Probably true. > >> All of this will break 'startup-fencing' because, from my understanding, >> the logic is this: >> >> 1) At startup all the nodes are marked (in >> lib/pengine/unpack.c:unpack_node) as unclean. >> 2) lib/pengine/unpack.c:unpack_status will cycle only the available >> in the cib status section resetting them to a clean status >> at the start and then putting them as unclean if some conditions are met. >> 3) pengine/allocate.c:stage6 all the unclean nodes are fenced. >> >> In the above conditions you'll have a in the cib status >> section also for nodes without pacemakerd enabled and the startup >> fencing won't happen because there isn't any condition in unpack_status >> that will mark them as unclean. > But they're unclean by default... so the lack of a node_state > shouldn't affect that. > Or did you mean "clean" instead of "unclean"? The problem is not the lack of node state but the opposite, the presence of a node state also if the nodes that haven't joined the cluster. This happens with the current cman integration. The nodes known to pacemaker are all setted as unclean by default (point 1 above). But if their is available in the CIB, then in point 2 they will be set as clean (unclean=false) and no condition check in unpack_status will mark them as unclean=true again. >> I'm not very expert of the code. I discarded the solution to not >> register at startup all the nodes known by cman but only the active ones >> as it won't fix the corosync+pcmkv1 case. >> >> Instead I tried to understand when a node that has its status in the cib >> should be startup fenced and a possible solution is in the attached patch. >> I noticed that when crm_update_peer inserts a new node this one doesn't >> have the expected attribute set. So if startup-fencing is enabled I'm >> going to set the node as expected up. > > You lost me there... isn't this covered by just setting startup-fencing=false? I lost you too :D . The problem is that startup-fencing is not working. Anyway. This first patche is a sort of attempt to make startup-fencing work when in the CIB there are tags also for nodes not in the cluster. But it was a fast attempt that I don't like it as my intention was primarily to explain the actual problem. But probably I wasn't very clear in doing this. Sorry. In the mail a sent after this one, I tried to make a first step changing the behavior of the cman integration to make it work like the other implementations: add tag only for the hosts that joined the cluster. Thanks! Bye! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Is there any way to reduce the time for migration of the resource from one node to another node in a cluster on failover.
On Tue, Mar 22, 2011 at 12:41 PM, rakesh k wrote: > Hi All > > I am providing you the configuration I used for testing the resource > migration. > > Node-1 resource failed . > Message sent to node-2 > the log message i found in ha-debug file (pengine: [15991]: notice: > common_apply_stickiness: Tomcat1 can fail 99 more times on mysql3 before > being forced off) > The process is getting started on same node where it fails after time out. > stopping virtual IP > send note to second node-2 > Starting VIP > second node starts the process. > > The total duration it is taking is about one and half minute Measured from when to when? > is there any > way to reduce the time for this sceanrio. > > Plese find the configuration i used > > node $id="6317f856-e57b-4a03-acf1-ca81af4f19ce" cisco-demomsf > node $id="87b8b88e-3ded-4e34-8708-46f7afe62935" mysql3 > primitive Tomcat1 ocf:heartbeat:tomcat \ > params tomcat_name="tomcat" > statusurl="http://localhost:8080/dbtest/testtomcat.html"; java_home="/" > catalina_home="/home/msf/runtime/tomcat/apache-tomcat-6.0.18" client="curl" > testregex="*" \ > op start interval="0" timeout="60s" \ > op monitor interval="50s" timeout="50s" \ > op stop interval="0" \ > meta target-role="Started" > primitive Tomcat1VIP ocf:heartbeat:IPaddr3 \ > params ip="" eth_num="eth0:2" > vip_cleanup_file="/var/run/bigha.pid" \ > op start interval="0" timeout="120s" \ > op monitor interval="30s" \ > meta target-role="Started" > colocation Tomcat1-with-ip inf: Tomcat1VIP Tomcat1 > order Tomcat1-after-ip inf: Tomcat1VIP Tomcat1 > property $id="cib-bootstrap-options" \ > dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ > cluster-infrastructure="Heartbeat" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1300787402" > rsc_defaults $id="rsc-options" \ > resource-stickiness="500" > Regards > Rakesh > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening
On Thu, Mar 17, 2011 at 11:54 PM, Simone Gotti wrote: > Hi, > > When using corosync + pcmk v1 starting both corosync and pacemakerd (and > I think also using heartbeat or anything other than cman) as quorum > provider, at startup in the CIB will not be a entry for > the nodes that are not in cluster. No, I'm pretty sure heartbeat has the same behavior. > > Instead when using cman as quorum provider there will be a > for every node known by cman as lib/common/ais.c:cman_event_callback > calls crm_update_peer for every node reported by cman_get_nodes. Yep > Something similar will happen when using corosync+pcmkv1 if corosync is > started on N nodes but pacemakerd is started only on N-M nodes. Probably true. > All of this will break 'startup-fencing' because, from my understanding, > the logic is this: > > 1) At startup all the nodes are marked (in > lib/pengine/unpack.c:unpack_node) as unclean. > 2) lib/pengine/unpack.c:unpack_status will cycle only the available > in the cib status section resetting them to a clean status > at the start and then putting them as unclean if some conditions are met. > 3) pengine/allocate.c:stage6 all the unclean nodes are fenced. > > In the above conditions you'll have a in the cib status > section also for nodes without pacemakerd enabled and the startup > fencing won't happen because there isn't any condition in unpack_status > that will mark them as unclean. But they're unclean by default... so the lack of a node_state shouldn't affect that. Or did you mean "clean" instead of "unclean"? > > I'm not very expert of the code. I discarded the solution to not > register at startup all the nodes known by cman but only the active ones > as it won't fix the corosync+pcmkv1 case. > > Instead I tried to understand when a node that has its status in the cib > should be startup fenced and a possible solution is in the attached patch. > I noticed that when crm_update_peer inserts a new node this one doesn't > have the expected attribute set. So if startup-fencing is enabled I'm > going to set the node as expected up. You lost me there... isn't this covered by just setting startup-fencing=false? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fencing order
On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin wrote: > Hi. > > Today, we had a network outage. Quite a few problems suddenly arised in out > setup, including crashed corosync, known notify bug in DRBD RA and some > problem with VirtualDomain RA timeout on stop. > > But particularly strange was fencing behaviour. > > Initially, one node (wapgw1-1) has parted from the cluster. When connection > was restored, corosync has died on that node. It was considered "offline > unclean" and was scheduled to be fenced. Fencing by HP iLO did not work > (currently, I do not know why). Second priority fencing method is meatware, > and it did take time. > > Second node, wapgw1-2, hit DRBD notify bug and failed to stop some > resources. It was "online unclean". It also was scheduled to be fenced. HP > iLO was available for this node, but it had not been STONITHed until I > manually confirmed STONITH for wapgw1-1. > > When I confirmed first node restart, second node was fenced automatically. > > Is this ordering intended behaviour or a bug? A little of both. The ordering (in the PE) was added because stonithd wasn't able to cope with parallel fencing operations. I don't know if this is still the case for stonithd in 1.0. Perhaps Dejan can comment. Unfortunately, as you saw, this means that we fence nodes one by one - and that if op N fails, we never try op > N. Ideally the ordering would be removed, lets see what Dejan has to say. > > It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster. > > > -- > Pavel Levshin > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker with Apache2...
Guessing the status URL isnt enabled in the apache config. On Wed, Mar 23, 2011 at 8:53 PM, Pavel Levshin wrote: > 23.03.2011 17:10, Yannik Nicod: > > Failed actions: > WebSite_start_0 (node=clutest02, call=4, rc=1, status=complete): unknown > error > WebSite_monitor_0 (node=clutest01, call=3, rc=1, status=complete): > unknown error > WebSite_start_0 (node=clutest01, call=7, rc=1, status=complete): unknown > error > Can anybody tell me what I shoud do? A good hint? > > Logs are very helpful. You could search for operation names, i.e., > 'WebSite_start_0', and see what had happened. > > FYI, apache RA depends on 'server-status' feature of apache. > > > -- > Pavel Levshin > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] How to send email-notification on failure of resource in cluster frame work
"man crm_mon" look for the word "mail", if its not there - then whoever built the packages didnt include support for that feature On Thu, Mar 24, 2011 at 5:46 AM, Rakesh K wrote: > Hi ALL > Is there any way to send Email notifications when a resource is failure in the > cluster frame work. > > while i was going through the Pacemaker-explained document provided in the > website www.clusterlabs.org > > There was no content in the chapter 7 --> which is sending email notification > events. > > can anybody help me regarding this. > > for know i am approaching the crm_mon --daemonize --as-html to > maintain the status of HA in html file. > > Is there any other approach for sending email notification. > > Regards > Rakesh > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] CMAN integration questions
On Thu, Mar 24, 2011 at 9:27 AM, Vladislav Bogdanov wrote: > 23.03.2011 21:38, Pavel Levshin wrote: >> 23.03.2011 15:56, Vladislav Bogdanov: >> >> >>> After 1 minute vd01-d takes over DC role. >>> >>> Mar 23 10:10:03 vd01-d crmd: [1875]: info: update_dc: Set DC to vd01-d >>> (3.0.5) >> >> Excuse me, I have not much knowledge of cman integration, but don't you >> think that DC election should be much faster? > > I do. But this could depend on many factors, number of nodes in a > cluster (16 in my case), totem transport used on underlying layer > (UDPU), etc. Probably Andrew can clarify this, I cannot. Not really, I know very little of how cman works or is configured. Potentially its related to the messaging timeouts used by corosync when its reading the configuration from cluster.conf Pacemaker can only react to the information its been given - and the timeouts may affect how long it takes for that information to reach pacemaker. But its impossible to say much about what the cluster is doing based on the provided log fragments. >> Pacemaker hardly can work without DC. And STONITH of unexpectedly down >> node should be much faster, too. It's not clear from your log excerpts >> why fencing the node take so long. > > I understand. The main point was that fenced does the same much faster > if it has fence devices configured. Yes, but then then you create an internal split brain condition. > > I checked this, and fencing by fenced takes only 20 seconds on my setup. > Then DLM unlocks and cluster continues to work. The only drawback is > that node is killed twice, once by fenced, and once by pacemaker. But > this is very minor issue comparing to 3-4 minutes DLM lock. > > Maybe it could be possible to make pacemaker fencing faster in this > particular case, but this may require some efforts if it is even possible. > > Best, > Vladislav > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker