[Pacemaker] Return value from promote function

2011-02-07 Thread Bob Schatz
I am running Pacemaker 1.0.9.1 and Heartbeat 3.0.3.

I have a master/slave resource with an agent.

When the resource hangs while doing a promote, the resource returns 
OCF_ERR_GENERIC.

However, all this does is call demote on the resource, restart the resource on 
the same node and then retry the promote again on the same node.

Is there anyway I can have the CRM promote the resource on the peer node 
instead?

My configuration is:

node $id="856c1f72-7cd1-4906-8183-8be87eef96f2" mgraid-mkp9010repk-1
node $id="f4e5e15c-d06b-4e37-89b9-4621af05128f" mgraid-mkp9010repk-0
primitive SSMKP9010REPK ocf:omneon:ss \
params ss_resource="SSMKP9010REPK" 
ssconf="/var/omneon/config/config.MKP9010REPK" \
op monitor interval="3s" role="Master" timeout="7s" \
op monitor interval="10s" role="Slave" timeout="7" \
op stop interval="0" timeout="120" \
op start interval="0" timeout="600"
primitive icms lsb:S53icms \
op monitor interval="5s" timeout="7" \
op start interval="0" timeout="5"
primitive mgraid-stonith stonith:external/mgpstonith \
params hostlist="mgraid-canister" \
op monitor interval="0" timeout="20s"
primitive omserver lsb:S49omserver \
op monitor interval="5s" timeout="7" \
op start interval="0" timeout="5"
ms ms-SSMKP9010REPK SSMKP9010REPK \
meta clone-max="2" notify="true" globally-unique="false" 
target-role="Master"
clone Fencing mgraid-stonith
clone cloneIcms icms
clone cloneOmserver omserver
location ms-SSMKP9010REPK-master-w1 ms-SSMKP9010REPK \
rule $id="ms-SSMKP9010REPK-master-w1-rule" $role="master" 100: #uname 
eq 
mgraid-mkp9010repk-0
order orderms-SSMKP9010REPK 0: cloneIcms ms-SSMKP9010REPK
property $id="cib-bootstrap-options" \
dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
cluster-infrastructure="Heartbeat" \
dc-deadtime="5s" \
stonith-enabled="true"



Thanks,

Bob


  

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] The effects of /var being full on failure detection

2011-02-07 Thread Brett Delle Grazie
Hi Ryan,

On 7 February 2011 17:24, Ryan Thomson  wrote:

> We have /var mounted separately, but not /var/log. Interesting idea. Part of
> our /var problem was two fold: We had enabled debug logging and iptables
> logging to diagnose a previous problem and neglected to turn them off again
> after the diagnosis session which caused unusually high log volume, plus we
> never enabled logrotate for the firewall so it just grew and grew without
> being rotated out. Tough way to be reminded of improper configuration...

Sometimes learning things "the hard way" is necessary :)

Small word of caution about /var/log nested mount point - I haven't used this
in a long time. More modern distros that start everything at once might have
service dependency issues so YMMV.  i.e. test first.

>
> --Ryan
>

-- 
Best Regards,

Brett Delle Grazie

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] The effects of /var being full on failure detection

2011-02-07 Thread Ryan Thomson

Hi Brett,


My question is this: Would /var being full on the passive node have played a 
role in the cluster's inability to failover during the soft lockup condition on 
the active node? Or perhaps we hit a condition in which our configuration of 
pacemaker was unable to detect this type of failure? I'm basically trying to 
figure out if /var being full on the passive node played a role in the lack of 
failover or if our configuration is inadequate at detecting the type of failure 
we experienced.


I'd say absolutely yes. /var being full probably stopped cluster
traffic or at the least, changes to the cib from being accepted (from
memory cib changes are written to temp files in /var/lib/heartbeat/crm/...).


Thanks for the feedback. This is what I suspected but I wasn't sure if 
my suspicions were correct. Too bad I don't have a test/dev pacemaker 
environment to test this situation with, otherwise I could be 100% sure 
instead of 99% sure.



It can certainly stop ssh sessions from being established.


That it did!



Thoughts?


Just for the list (since I'm sure you've done this or similar already)
I'd suggest you use SNMP monitoring and add an SNMP trap for /var
being 95% full.


Yep, it's something we're on top of.


A useful addition is to mount /var/log on a different
disk/partition/logical volume from /var, that way even if your logs
fill up, the system should still continue to function for a while.


We have /var mounted separately, but not /var/log. Interesting idea. 
Part of our /var problem was two fold: We had enabled debug logging and 
iptables logging to diagnose a previous problem and neglected to turn 
them off again after the diagnosis session which caused unusually high 
log volume, plus we never enabled logrotate for the firewall so it just 
grew and grew without being rotated out. Tough way to be reminded of 
improper configuration...


--Ryan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync & IPAddr problems(?)

2011-02-07 Thread Dejan Muhamedagic
Hi,

On Mon, Feb 07, 2011 at 02:01:11PM +0100, Stephan-Frank Henry wrote:
> Hello again,
> 
> I am having some possible problems with Corosync and IPAddr.
> To be more specific, when I do a /etc/init.d/corosync stop, while everything 
> shuts down more or less gracefully, the virtual ip never is released (still 
> visible with ifconfig).
> 
> if I do a 'sudo ifdown --force eth0:0' it works. So there should be no direct 
> reason for this.
> 
> This might not by itself be a problem, but I fear it could also be related to 
> a 'split-brain' corosync handling due to network cable disconnect.
> Though that might be something else, I'd rather remove all other problems and 
> then see if it fixes itself.
> 
> I have checked syslog, but nothing really jumps out.
> Are there any other logs or places where I can look?
> 
> thanks everyone!
> 
> Frank
> 
> (pls scream if more or other info is needed)
> 
> -
> 
> OS: Debian Lenny 64bit, kernel version: 2.6.33.3
> Corosnyc: 1.2.1-1~bpo50+1
> cluster-glue: 1.0.6-1~bpo50+1
> libheartbeat2: 1:3.0.3-2~bpo50+1
> 
> relevant cib.xml entry:
> 
>   
> 
>   
>   
>   
> 
>   
>   
> 
>   
> 
> 
> here is a reduced log (only the ip stuff):
> Feb  7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation 
> ip_resource_monitor_0 found resource ip_resource active on serverA
> Feb  7 13:39:40 serverA pengine: [8695]: notice: native_print:  
> ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: 
> ms_drbd0: Rolling back scores from ip_resource
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: 
> ms_drbd0: Rolling back scores from ip_resource
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: 
> ip_resource: Rolling back scores from fs0
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_color: Resource 
> ip_resource cannot run anywhere
> Feb  7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource 
> ip_resource#011(serverA)
> Feb  7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
> cause=C_IPC_MESSAGE origin=handle_response ]
> Feb  7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating action 
> 33: stop ip_resource_stop_0 on serverA (local)
> Feb  7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7] 
> on ocf::IPaddr::ip_resource for client 8696, its parameters: 
> CRM_meta_interval=[1] ip=[150.158.183.30] 
> Feb  7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing 
> key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 )
> Feb  7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop
> Feb  7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM operation 
> ip_resource_monitor_1 (call=7, status=1, cib-update=0, confirmed=true) 
> Cancelled
> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID 
> 10541) timed out (try 1).  Killing with signal SIGTERM (15).

The stop action times out. You should check why. Note that
ifdown ... is not what IPaddr uses, but ifconfig down. You can
also test the resource using ocf-tester outside of cluster.

Thanks,

Dejan

> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on 
> ocf::IPaddr::ip_resource for client 8696, its parameters: ip=[150.158.183.30] 
> cidr_netmask=[22] CRM_meta_timeout=[2] 
> Feb  7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot 
> record operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the 
> client is gone
> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the 
> operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696, its 
> parameters: ip=[150.158.183.30] 
> 
> -- 
> Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!  
> Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Corosync & IPAddr problems(?)

2011-02-07 Thread Shravan Mishra
Try using IPAddr2

-Shravan

On Mon, Feb 7, 2011 at 8:01 AM, Stephan-Frank Henry wrote:

> Hello again,
>
> I am having some possible problems with Corosync and IPAddr.
> To be more specific, when I do a /etc/init.d/corosync stop, while
> everything shuts down more or less gracefully, the virtual ip never is
> released (still visible with ifconfig).
>
> if I do a 'sudo ifdown --force eth0:0' it works. So there should be no
> direct reason for this.
>
> This might not by itself be a problem, but I fear it could also be related
> to a 'split-brain' corosync handling due to network cable disconnect.
> Though that might be something else, I'd rather remove all other problems
> and then see if it fixes itself.
>
> I have checked syslog, but nothing really jumps out.
> Are there any other logs or places where I can look?
>
> thanks everyone!
>
> Frank
>
> (pls scream if more or other info is needed)
>
> -
>
> OS: Debian Lenny 64bit, kernel version: 2.6.33.3
> Corosnyc: 1.2.1-1~bpo50+1
> cluster-glue: 1.0.6-1~bpo50+1
> libheartbeat2: 1:3.0.3-2~bpo50+1
>
> relevant cib.xml entry:
> 
>  
>
>  
>  
>  
>
>  
>  
>
>  
> 
>
> here is a reduced log (only the ip stuff):
> Feb  7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation
> ip_resource_monitor_0 found resource ip_resource active on serverA
> Feb  7 13:39:40 serverA pengine: [8695]: notice: native_print:
>  ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights:
> ms_drbd0: Rolling back scores from ip_resource
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights:
> ms_drbd0: Rolling back scores from ip_resource
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights:
> ip_resource: Rolling back scores from fs0
> Feb  7 13:39:40 serverA pengine: [8695]: info: native_color: Resource
> ip_resource cannot run anywhere
> Feb  7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource
> ip_resource#011(serverA)
> Feb  7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Feb  7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating
> action 33: stop ip_resource_stop_0 on serverA (local)
> Feb  7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7]
> on ocf::IPaddr::ip_resource for client 8696, its parameters:
> CRM_meta_interval=[1] ip=[150.158.183.30]
> Feb  7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing
> key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 )
> Feb  7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop
> Feb  7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM
> operation ip_resource_monitor_1 (call=7, status=1, cib-update=0,
> confirmed=true) Cancelled
> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID
> 10541) timed out (try 1).  Killing with signal SIGTERM (15).
> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on
> ocf::IPaddr::ip_resource for client 8696, its parameters:
> ip=[150.158.183.30] cidr_netmask=[22] CRM_meta_timeout=[2]
> Feb  7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot
> record operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the
> client is gone
> Feb  7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the
> operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696,
> its parameters: ip=[150.158.183.30]
>
> --
> Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!
> Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Corosync & IPAddr problems(?)

2011-02-07 Thread Stephan-Frank Henry
Hello again,

I am having some possible problems with Corosync and IPAddr.
To be more specific, when I do a /etc/init.d/corosync stop, while everything 
shuts down more or less gracefully, the virtual ip never is released (still 
visible with ifconfig).

if I do a 'sudo ifdown --force eth0:0' it works. So there should be no direct 
reason for this.

This might not by itself be a problem, but I fear it could also be related to a 
'split-brain' corosync handling due to network cable disconnect.
Though that might be something else, I'd rather remove all other problems and 
then see if it fixes itself.

I have checked syslog, but nothing really jumps out.
Are there any other logs or places where I can look?

thanks everyone!

Frank

(pls scream if more or other info is needed)

-

OS: Debian Lenny 64bit, kernel version: 2.6.33.3
Corosnyc: 1.2.1-1~bpo50+1
cluster-glue: 1.0.6-1~bpo50+1
libheartbeat2: 1:3.0.3-2~bpo50+1

relevant cib.xml entry:

  

  
  
  

  
  

  


here is a reduced log (only the ip stuff):
Feb  7 13:39:40 serverA pengine: [8695]: notice: unpack_rsc_op: Operation 
ip_resource_monitor_0 found resource ip_resource active on serverA
Feb  7 13:39:40 serverA pengine: [8695]: notice: native_print:  
ip_resource#011(ocf::heartbeat:IPaddr):#011Started serverA
Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: ms_drbd0: 
Rolling back scores from ip_resource
Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: ms_drbd0: 
Rolling back scores from ip_resource
Feb  7 13:39:40 serverA pengine: [8695]: info: native_merge_weights: 
ip_resource: Rolling back scores from fs0
Feb  7 13:39:40 serverA pengine: [8695]: info: native_color: Resource 
ip_resource cannot run anywhere
Feb  7 13:39:40 serverA pengine: [8695]: notice: LogActions: Stop resource 
ip_resource#011(serverA)
Feb  7 13:39:40 serverA crmd: [8696]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Feb  7 13:39:42 serverA crmd: [8696]: info: te_rsc_command: Initiating action 
33: stop ip_resource_stop_0 on serverA (local)
Feb  7 13:39:42 serverA lrmd: [8693]: info: cancel_op: operation monitor[7] on 
ocf::IPaddr::ip_resource for client 8696, its parameters: 
CRM_meta_interval=[1] ip=[150.158.183.30] 
Feb  7 13:39:42 serverA crmd: [8696]: info: do_lrm_rsc_op: Performing 
key=33:13:0:0dff3321-22f5-411c-a50a-e95fcfa4dd6f op=ip_resource_stop_0 )
Feb  7 13:39:42 serverA lrmd: [8693]: info: rsc:ip_resource:14: stop
Feb  7 13:39:42 serverA crmd: [8696]: info: process_lrm_event: LRM operation 
ip_resource_monitor_1 (call=7, status=1, cib-update=0, confirmed=true) 
Cancelled
Feb  7 13:40:02 serverA lrmd: [8693]: WARN: ip_resource:stop process (PID 
10541) timed out (try 1).  Killing with signal SIGTERM (15).
Feb  7 13:40:02 serverA lrmd: [8693]: WARN: operation stop[14] on 
ocf::IPaddr::ip_resource for client 8696, its parameters: ip=[150.158.183.30] 
cidr_netmask=[22] CRM_meta_timeout=[2] 
Feb  7 13:40:02 serverA lrmd: [8693]: info: record_op_completion: cannot record 
operation stop[14] on ocf::IPaddr::ip_resource for client 8696: the client is 
gone
Feb  7 13:40:02 serverA lrmd: [8693]: WARN: notify_client: client for the 
operation operation stop[14] on ocf::IPaddr::ip_resource for client 8696, its 
parameters: ip=[150.158.183.30] 

-- 
Neu: GMX De-Mail - Einfach wie E-Mail, sicher wie ein Brief!  
Jetzt De-Mail-Adresse reservieren: http://portal.gmx.net/de/go/demail

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker