[Pacemaker] Return value from promote function
I am running Pacemaker 1.0.9.1 and Heartbeat 3.0.3. I have a master/slave resource with an agent. When the resource hangs while doing a promote, the resource returns OCF_ERR_GENERIC. However, all this does is call demote on the resource, restart the resource on the same node and then retry the promote again on the same node. Is there anyway I can have the CRM promote the resource on the peer node instead? My configuration is: node $id="856c1f72-7cd1-4906-8183-8be87eef96f2" mgraid-mkp9010repk-1 node $id="f4e5e15c-d06b-4e37-89b9-4621af05128f" mgraid-mkp9010repk-0 primitive SSMKP9010REPK ocf:omneon:ss \ params ss_resource="SSMKP9010REPK" ssconf="/var/omneon/config/config.MKP9010REPK" \ op monitor interval="3s" role="Master" timeout="7s" \ op monitor interval="10s" role="Slave" timeout="7" \ op stop interval="0" timeout="120" \ op start interval="0" timeout="600" primitive icms lsb:S53icms \ op monitor interval="5s" timeout="7" \ op start interval="0" timeout="5" primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist="mgraid-canister" \ op monitor interval="0" timeout="20s" primitive omserver lsb:S49omserver \ op monitor interval="5s" timeout="7" \ op start interval="0" timeout="5" ms ms-SSMKP9010REPK SSMKP9010REPK \ meta clone-max="2" notify="true" globally-unique="false" target-role="Master" clone Fencing mgraid-stonith clone cloneIcms icms clone cloneOmserver omserver location ms-SSMKP9010REPK-master-w1 ms-SSMKP9010REPK \ rule $id="ms-SSMKP9010REPK-master-w1-rule" $role="master" 100: #uname eq mgraid-mkp9010repk-0 order orderms-SSMKP9010REPK 0: cloneIcms ms-SSMKP9010REPK property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ dc-deadtime="5s" \ stonith-enabled="true" Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] The effects of /var being full on failure detection
Hi Ryan, On 7 February 2011 17:24, Ryan Thomson wrote: > We have /var mounted separately, but not /var/log. Interesting idea. Part of > our /var problem was two fold: We had enabled debug logging and iptables > logging to diagnose a previous problem and neglected to turn them off again > after the diagnosis session which caused unusually high log volume, plus we > never enabled logrotate for the firewall so it just grew and grew without > being rotated out. Tough way to be reminded of improper configuration... Sometimes learning things "the hard way" is necessary :) Small word of caution about /var/log nested mount point - I haven't used this in a long time. More modern distros that start everything at once might have service dependency issues so YMMV. i.e. test first. > > --Ryan > -- Best Regards, Brett Delle Grazie ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] The effects of /var being full on failure detection
Hi Brett, My question is this: Would /var being full on the passive node have played a role in the cluster's inability to failover during the soft lockup condition on the active node? Or perhaps we hit a condition in which our configuration of pacemaker was unable to detect this type of failure? I'm basically trying to figure out if /var being full on the passive node played a role in the lack of failover or if our configuration is inadequate at detecting the type of failure we experienced. I'd say absolutely yes. /var being full probably stopped cluster traffic or at the least, changes to the cib from being accepted (from memory cib changes are written to temp files in /var/lib/heartbeat/crm/...). Thanks for the feedback. This is what I suspected but I wasn't sure if my suspicions were correct. Too bad I don't have a test/dev pacemaker environment to test this situation with, otherwise I could be 100% sure instead of 99% sure. It can certainly stop ssh sessions from being established. That it did! Thoughts? Just for the list (since I'm sure you've done this or similar already) I'd suggest you use SNMP monitoring and add an SNMP trap for /var being 95% full. Yep, it's something we're on top of. A useful addition is to mount /var/log on a different disk/partition/logical volume from /var, that way even if your logs fill up, the system should still continue to function for a while. We have /var mounted separately, but not /var/log. Interesting idea. Part of our /var problem was two fold: We had enabled debug logging and iptables logging to diagnose a previous problem and neglected to turn them off again after the diagnosis session which caused unusually high log volume, plus we never enabled logrotate for the firewall so it just grew and grew without being rotated out. Tough way to be reminded of improper configuration... --Ryan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Corosync & IPAddr problems(?)
Hi, On Mon, Feb 07, 2011 at 02:01:11PM +0100, Stephan-Frank Henry wrote: > Hello again, > > I am having some possible problems with Corosync and IPAddr. > To be more specific, when I do a /etc/init.d/corosync stop, while everything > shuts down more or less gracefully, the virtual ip never is released (still > visible with ifconfig). > > if I do a 'sudo ifdown --force eth0:0' it works. So there should be no direct > reason for this. > > This might not by itself be a problem, but I fear it could also be related to > a 'split-brain' corosync handling due to network cable disconnect. > Though that might be something else, I'd rather remove all other problems and > then see if it fixes itself. > > I have checked syslog, but nothing really jumps out. > Are there any other logs or places where I can look? > > thanks everyone! > > Frank > > (pls scream if more or other info is needed) > > - > > OS: Debian Lenny 64bit, kernel version: 2.6.33.3 > Corosnyc: 1.2.1-1~bpo50+1 > cluster-glue: 1.0.6-1~bpo50+1 > libheartbeat2: 1:3.0.3-2~bpo50+1 > > relevant cib.xml entry: > > > > > > > > > > > > > > here is a reduced log (only the ip stuff): > Feb 7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation > ip_resource_monitor_0 found resource ip_resource active on serverA > Feb 7 13:39:40 serverA pengine: [8695]: notice: native_print: > ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ms_drbd0: Rolling back scores from ip_resource > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ms_drbd0: Rolling back scores from ip_resource > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ip_resource: Rolling back scores from fs0 > Feb 7 13:39:40 serverA pengine: [8695]: info: native_color: Resource > ip_resource cannot run anywhere > Feb 7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource > ip_resource#011(serverA) > Feb 7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Feb 7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating action > 33: stop ip_resource_stop_0 on serverA (local) > Feb 7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7] > on ocf::IPaddr::ip_resource for client 8696, its parameters: > CRM_meta_interval=[1] ip=[150.158.183.30] > Feb 7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing > key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 ) > Feb 7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop > Feb 7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM operation > ip_resource_monitor_1 (call=7, status=1, cib-update=0, confirmed=true) > Cancelled > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID > 10541) timed out (try 1). Killing with signal SIGTERM (15). The stop action times out. You should check why. Note that ifdown ... is not what IPaddr uses, but ifconfig down. You can also test the resource using ocf-tester outside of cluster. Thanks, Dejan > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on > ocf::IPaddr::ip_resource for client 8696, its parameters: ip=[150.158.183.30] > cidr_netmask=[22] CRM_meta_timeout=[2] > Feb 7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot > record operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the > client is gone > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the > operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696, its > parameters: ip=[150.158.183.30] > > -- > Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! > Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Corosync & IPAddr problems(?)
Try using IPAddr2 -Shravan On Mon, Feb 7, 2011 at 8:01 AM, Stephan-Frank Henry wrote: > Hello again, > > I am having some possible problems with Corosync and IPAddr. > To be more specific, when I do a /etc/init.d/corosync stop, while > everything shuts down more or less gracefully, the virtual ip never is > released (still visible with ifconfig). > > if I do a 'sudo ifdown --force eth0:0' it works. So there should be no > direct reason for this. > > This might not by itself be a problem, but I fear it could also be related > to a 'split-brain' corosync handling due to network cable disconnect. > Though that might be something else, I'd rather remove all other problems > and then see if it fixes itself. > > I have checked syslog, but nothing really jumps out. > Are there any other logs or places where I can look? > > thanks everyone! > > Frank > > (pls scream if more or other info is needed) > > - > > OS: Debian Lenny 64bit, kernel version: 2.6.33.3 > Corosnyc: 1.2.1-1~bpo50+1 > cluster-glue: 1.0.6-1~bpo50+1 > libheartbeat2: 1:3.0.3-2~bpo50+1 > > relevant cib.xml entry: > > > > > > > > > > > > > > here is a reduced log (only the ip stuff): > Feb 7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation > ip_resource_monitor_0 found resource ip_resource active on serverA > Feb 7 13:39:40 serverA pengine: [8695]: notice: native_print: > ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ms_drbd0: Rolling back scores from ip_resource > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ms_drbd0: Rolling back scores from ip_resource > Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: > ip_resource: Rolling back scores from fs0 > Feb 7 13:39:40 serverA pengine: [8695]: info: native_color: Resource > ip_resource cannot run anywhere > Feb 7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource > ip_resource#011(serverA) > Feb 7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Feb 7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating > action 33: stop ip_resource_stop_0 on serverA (local) > Feb 7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7] > on ocf::IPaddr::ip_resource for client 8696, its parameters: > CRM_meta_interval=[1] ip=[150.158.183.30] > Feb 7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing > key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 ) > Feb 7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop > Feb 7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM > operation ip_resource_monitor_1 (call=7, status=1, cib-update=0, > confirmed=true) Cancelled > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID > 10541) timed out (try 1). Killing with signal SIGTERM (15). > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on > ocf::IPaddr::ip_resource for client 8696, its parameters: > ip=[150.158.183.30] cidr_netmask=[22] CRM_meta_timeout=[2] > Feb 7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot > record operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the > client is gone > Feb 7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the > operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696, > its parameters: ip=[150.158.183.30] > > -- > Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! > Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Corosync & IPAddr problems(?)
Hello again, I am having some possible problems with Corosync and IPAddr. To be more specific, when I do a /etc/init.d/corosync stop, while everything shuts down more or less gracefully, the virtual ip never is released (still visible with ifconfig). if I do a 'sudo ifdown --force eth0:0' it works. So there should be no direct reason for this. This might not by itself be a problem, but I fear it could also be related to a 'split-brain' corosync handling due to network cable disconnect. Though that might be something else, I'd rather remove all other problems and then see if it fixes itself. I have checked syslog, but nothing really jumps out. Are there any other logs or places where I can look? thanks everyone! Frank (pls scream if more or other info is needed) - OS: Debian Lenny 64bit, kernel version: 2.6.33.3 Corosnyc: 1.2.1-1~bpo50+1 cluster-glue: 1.0.6-1~bpo50+1 libheartbeat2: 1:3.0.3-2~bpo50+1 relevant cib.xml entry: here is a reduced log (only the ip stuff): Feb 7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation ip_resource_monitor_0 found resource ip_resource active on serverA Feb 7 13:39:40 serverA pengine: [8695]: notice: native_print: ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: ms_drbd0: Rolling back scores from ip_resource Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: ms_drbd0: Rolling back scores from ip_resource Feb 7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: ip_resource: Rolling back scores from fs0 Feb 7 13:39:40 serverA pengine: [8695]: info: native_color: Resource ip_resource cannot run anywhere Feb 7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource ip_resource#011(serverA) Feb 7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Feb 7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating action 33: stop ip_resource_stop_0 on serverA (local) Feb 7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7] on ocf::IPaddr::ip_resource for client 8696, its parameters: CRM_meta_interval=[1] ip=[150.158.183.30] Feb 7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 ) Feb 7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop Feb 7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM operation ip_resource_monitor_1 (call=7, status=1, cib-update=0, confirmed=true) Cancelled Feb 7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID 10541) timed out (try 1). Killing with signal SIGTERM (15). Feb 7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on ocf::IPaddr::ip_resource for client 8696, its parameters: ip=[150.158.183.30] cidr_netmask=[22] CRM_meta_timeout=[2] Feb 7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot record operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the client is gone Feb 7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696, its parameters: ip=[150.158.183.30] -- Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief! Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker