Re: [Linux-HA] pacemaker questions of 4 cases

2010-01-19 Thread Andrew Beekhof
On Fri, Jan 15, 2010 at 7:23 AM, 梁景明  wrote:
> hi ,there are 4 cases in my application to use pacemaker
> case 1: one tomcat unexpected down ,restart it by pacemaker.
> case 2: one tomcat served machine unexpected down ,fail over to another
> machine,back if it recover.
> case 3: some tomcat only run on some special nodes ,the others cant monitor
> it.
> case 4: one server application runs only after some application started , it
> means runs by order.
>
> first i try to exam case 1.
> i built a 4 nodes environment to test it ,and standby three nodes .like this
>
> 
> Last updated: Fri Jan 15 11:57:49 2010
> Stack: openais
> Current DC: bak1 - partition with quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 4 Nodes configured, 4 expected votes
> 1 Resources configured.
> 
>
> Node bak1: standby
> Node test1: standby
> Node test2: standby
> Online: [ ubuntu ]
>
> and tomcat lsb script i use the example from the doc on wiki. it started on
> node ubuntu like this "*sudo sh /etc/init.d/tomcatpace start*" no problem.
> crm configure :
>
> node bak1 \
>    attributes standby="on"
> node test1 \
>    attributes standby="on"
> node test2 \
>    attributes standby="on"
> node ubuntu
> primitive tomcat lsb:tomcatpace \
>    op monitor interval="10" timeout="30s" \
>    meta migration-threshold="10" target-role="Started"
>
> first i think only ubuntu is online ,so the script only run on ubuntu ,is it
> right?
> but it fails .it seems to be all the nodes running the script.
>
> Node bak1: standby
> Node test1: standby
> Node test2: standby
> Online: [ ubuntu ]
>
> tomcat    (lsb:tomcatpace) Started [    bak1    test1    test2 ]
>
> Failed actions:
>    tomcat_monitor_0 (node=bak1, call=2, rc=254, status=complete): 
>    tomcat_stop_0 (node=bak1, call=3, rc=254, status=complete): 
>    tomcat_monitor_0 (node=test1, call=2, rc=254, status=complete):
> 
>    tomcat_stop_0 (node=test1, call=3, rc=254, status=complete): 
>    tomcat_monitor_0 (node=test2, call=2, rc=254, status=complete):
> 
>    tomcat_stop_0 (node=test2, call=3, rc=254, status=complete): 

Have a look at:
   http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active

In your case, the failed actions indicate the script is not LSB compliant:
   
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html

First thing to do before trying anything else is to fix the script.

>
> then i added location rule ,but i am not sure about the usage of it ,so  i
> followed the example .
>
> location prefer-ubuntu tomcat \
>        rule $id="prefer-rule" 100: #uname eq ubuntu
>
> is the line to prefer ubuntu node ,and only run on that node ? current
> configure :
>
> node bak1 \
>        attributes standby="on"
> node test1 \
>        attributes standby="on"
> node test2 \
>        attributes standby="on"
> node ubuntu
> primitive tomcat lsb:tomcatpace \
>        op monitor interval="10" timeout="30s" \
>        meta migration-threshold="10" target-role="Started"
> location prefer-ubuntu tomcat \
>        rule $id="prefer-rule" 100: #uname eq ubuntu
>
> but it fails again
>
> Node bak1: standby
> Node test1: standby
> Node test2: standby
> Online: [ ubuntu ]
>
> tomcat  (lsb:tomcatpace) Started [      bak1    test1   test2 ]
>
> Failed actions:
>    tomcat_monitor_0 (node=bak1, call=2, rc=254, status=complete): 
>    tomcat_stop_0 (node=bak1, call=3, rc=254, status=complete): 
>    tomcat_monitor_0 (node=test1, call=2, rc=254, status=complete): 
>    tomcat_stop_0 (node=test1, call=3, rc=254, status=complete): 
>    tomcat_monitor_0 (node=test2, call=2, rc=254, status=complete):   
> )
>
> thanks for any help .
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Colocation of 2 resources so that it can't run together

2010-01-19 Thread Andrew Beekhof
On Fri, Jan 15, 2010 at 8:39 PM, jaspal singla  wrote:
> Hello,
>
> Thanks for your prompt response and also I have some doubts, it will be if
> these also will get cleared.
>
> Please find my inline queries:
>
>
>> > group group_vz_1 vip_ipaddr2 filesystem1_Filesystem vz1_script \
>> >         meta target-role="started"
>> > group group_vz_2 vip2_ipaddr2 filesystem2_Filesystem vz2_script \
>> >         meta target-role="started"
>> > location location_master group_vz_1 700: node_master
>> > location location_node3 group_vz_2 600: node3
>> > location location_slave_1 group_vz_1 0: node_slave
>> > location location_slave_2 group_vz_2 0: node_slave
>> > colocation colocation_vz_test -inf: group_vz_1 group_vz_2
>>
>> The anti-collocation rule you have is correct, and this should result in
>> the resources not being placed on the same node.
>>
>
> Yes this configuration is working fine now as per my required scenario after
> restart of the node_slave node but I don't know why after restart of the
> node_slave the behavior became alright..

Did you attach a hb_report from before you rebooted node_slave?

>> Disabling stonith is not a good idea if you're running shared storage.
>>
>>
> For stonith integration, unfortunately I am using old hardware servers where
> there is not any ILO such kind of ports and my management also don't want to
> invest money for the APC Power switch..
>
> Please suggest me Can I go for SSH based stonith mechanism for my Production
> setup or better to leave my Production cluster without stonith?

Neither is an option I would consider appropriate for a production cluster.

>
>> rsc_defaults $id="rsc_defaults-options"
>>
>> You may want to enable resource-stickiness to avoid resources shuffling
>> around needlessly.
>>
>>
> What is the use of default_resource-stickness value(If in case of we have
> defined resource-stickiness value to all primitive resources)?? As I have
> already defined the resource_stickiness value to my all primitive
> resources..

If you have already defined it for every resource, then there is no
need to supply a default as well.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Openais] Problem with cluster linux HA

2010-01-19 Thread Andrew Beekhof
On Mon, Jan 18, 2010 at 2:46 PM, Galera, Daniel  wrote:
> Hell all,
>
> I have 2 Suse Linux Enterprise 11 Servers with High Av. Extension. I'm
> configuring a cluster with 2 nodes for the cluster and only 1 group to run.
> I use SBD as STONITH. I set the cluster correctly without problems.  Now i
> want to have an application clustered named HPOS. for that i need to have in
> the group: SFEX --:> to lock the drive LVM --> to activate the VG Filesystem
> --> to mount the 3 filesystems needed IP --> to bring online IP of the
> cluster and then two LSB to run the 2 processes of application HPOS. anyway,
> the application is not the problem. The problem is that when i want to test
> cluster and for example MOVE resource to the other node (server1)... the
> group becomes down and server2 appears as offline with Stonith UNCLEAN.

Usually its when a resource fails to stop.
Please use hb_report to generate a tarball and indicate which node you
tried to move the resource to (and how)

> that
> info checking from server1 if at that moment i check crm_mon from server2, i
> see server2 as online but server1 down. No idea what the problem is.
>
> Attached you the cluster XML config file.
>
> Attached the log files of 2 nodes when i executed the MOVE RESOURCE that
> failed.
>
> am i missing any resource location or any other expected thing?
>
> do you have any cluster example so i can configure correctluy mine?
>
> regards
>
> Dani
>
> ___
> Openais mailing list
> open...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] messages from existing hearbeat on the same lan

2010-01-19 Thread Andrew Beekhof
On Tue, Jan 19, 2010 at 1:15 PM, Dominik Klein  wrote:
> Aclhk Aclhk wrote:
>> On the same lan, there are already two heartbeat node 136pri and 137sec.
>>
>> I setup another 2 nodes with heartbeat. they keep receiving the following 
>> messages:
>>
>> heartbeat[9931]: 2010/01/19_10:53:01 WARN: string2msg_ll: node [136pri] 
>> failed authentication
>> heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] 
>> in message!
>> heartbeat[9931]: 2010/01/19_10:53:02 WARN: string2msg_ll: node [137sec] 
>> failed authentication
>> heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] 
>> in message!
>>
>> ha.cf
>> debugfile /var/log/ha-debug
>> logfile /var/log/ha-log
>> logfacility local0
>> bcast eth0
>> keepalive 5
>> warntime 10
>> deadtime 120
>> initdead 120
>> auto_failback off
>> node 140openfiler1
>> node 141openfiler2
>>
>> bcast for all nodes are same, that is eth0
>>
>> pls advise how to avoid the messages.
>
> Use mcast or ucast instead of bcast?

Or change the port
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] messages from existing hearbeat on the same lan

2010-01-19 Thread Dominik Klein
Aclhk Aclhk wrote:
> On the same lan, there are already two heartbeat node 136pri and 137sec.
> 
> I setup another 2 nodes with heartbeat. they keep receiving the following 
> messages:
> 
> heartbeat[9931]: 2010/01/19_10:53:01 WARN: string2msg_ll: node [136pri] 
> failed authentication
> heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
> message!
> heartbeat[9931]: 2010/01/19_10:53:02 WARN: string2msg_ll: node [137sec] 
> failed authentication
> heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
> message!
> 
> ha.cf
> debugfile /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility local0
> bcast eth0
> keepalive 5
> warntime 10
> deadtime 120
> initdead 120
> auto_failback off
> node 140openfiler1
> node 141openfiler2
> 
> bcast for all nodes are same, that is eth0
> 
> pls advise how to avoid the messages.

Use mcast or ucast instead of bcast?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] messages from existing hearbeat on the same lan

2010-01-19 Thread Aclhk Aclhk
On the same lan, there are already two heartbeat node 136pri and 137sec.

I setup another 2 nodes with heartbeat. they keep receiving the following 
messages:

heartbeat[9931]: 2010/01/19_10:53:01 WARN: string2msg_ll: node [136pri] failed 
authentication
heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
message!
heartbeat[9931]: 2010/01/19_10:53:02 WARN: string2msg_ll: node [137sec] failed 
authentication
heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
message!

ha.cf
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
bcast eth0
keepalive 5
warntime 10
deadtime 120
initdead 120
auto_failback off
node 140openfiler1
node 141openfiler2

bcast for all nodes are same, that is eth0

pls advise how to avoid the messages.




  Yahoo!香港提供網上安全攻略,教你如何防範黑客! 請前往 http://hk.promo.yahoo.com/security/ 了解更多!
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems