Re: [Linux-HA] always have to cleanup LSB script on failover

2008-04-25 Thread Dominik Klein

[EMAIL PROTECTED] wrote:

Hello list,

I have an ordered and collocated group that consists of the following 
elements that startup in order:

Resource Group: GROUPS_KNWORKS_mail
   drbddisk_knworks_mail   (heartbeat:drbddisk):   Started asknmapr01
   drbddisk_axigenbin  (heartbeat:drbddisk):   Started asknmapr01
   ip_knworks_mail (heartbeat::ocf:IPaddr2):   Started asknmapr01
   fs_knworks_mail (heartbeat::ocf:Filesystem):Started asknmapr01
   fs_axigen_bin   (heartbeat::ocf:Filesystem):Started asknmapr01
   ip_knworks_mail_external(heartbeat::ocf:IPaddr2):   Started 
asknmapr01

   axigen_initscript   (lsb:axigen):   Started asknmapr01
   axigenfilters_initscript(lsb:axigenfilters):Started asknmapr01

whenever I failover/migrate the group between nodes, everything works 
just as expected, however the 2 bottom LSB scripts never start.  they 
just stay in stopped mode until I run the following commands:


for x in asknmapr01 asknmapr02; do crm_resource -C -r 
axigenfilters_initscript -H $x; done
for x in asknmapr01 asknmapr02; do crm_resource -C -r axigen_initscript 
-H $x; done


These 2 command run cleanup for both scripts, on both nodes.  after I 
run them the scripts start fine without any action from me.  upon 
migration again, the same thing occurs.  I have to cleanup the scripts 
once again to get them to run.


Any ideas would be very appreciated.

Thanks!


Did you check your script is LSB compliant? A howto is here: 
http://wiki.linux-ha.org/LSBResourceAgent


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource reached -INIFINITY

2008-04-24 Thread Dominik Klein

Due to a temporary initialization problem, a resource reached a -INIFINITY
score on one node.

 


Is there a way to instruct the heartbeat to recalculate the score of the
resource on the node without restart the heartbeat?


Clean the resource
crm_resource -C -r $res -H $node

Any maybe you have to reset the failcount. But that depends on what 
happened and on your configuration.

crm_failcount -D -r $res -U $node

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Almost done with my HA setup, but somethign not working

2008-04-22 Thread Dominik Klein

Nick Duda wrote:

(sorry for the long email, but all my configs are here to view)

I posted before about HA with 2 squid servers. It's just about done, but 
stumbling on something. Everytime i manually cause something to happen 
in hopes to see it failover, it doesnt. For example, I get crm_mon to 
show everything as I want it, and when I kill squid (and prevent the xml 
from restarting it) it just goes into a failed state...more below. 
Anyone see anything wrong with my configs?


Server #1
Hostname: ha-1
eth0 - lan (192.168.95.1)
eth1 - xover to eth1 on other server

Server #2
Hostname: ha-2
eth0 - lan (192.168.95.2)
eth1 - xover to eth1 on other server

ha.cf on each server:

bcast eth1
mcast eth0 239.0.0.2 694 1 0
node ha-1 ha-2
crm on

Not using haresources because of crm

Here is the output from crm_mon:


Last updated: Mon Apr 21 15:44:53 2008
Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d)
2 Nodes configured.
1 Resources configured.


Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online
Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online

Resource Group: squid-cluster
   ip0 (heartbeat::ocf:IPaddr2):   Started ha-1
   squid   (heartbeat::ocf:squid): Started ha-1

If squid stops on the current heartbeat serer, ha-1, it will restart 
within 60sec...so the scripting is working. If i stop the squid process 
and rename it in /etc/init.d/squid to something else, the script wont be 
able to execute the squid start and should failover to ha-2, but it 
doesnt, instead this appears (on both ha-1 and ha-2):


What exactly do you rename and how? It's likely the cluster is 
behaving sane and you're just creating a testcase you don't understand.


Regards
Dominik



Last updated: Mon Apr 21 15:47:49 2008
Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d)
2 Nodes configured.
1 Resources configured.


Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online
Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online

Resource Group: squid-cluster
   ip0 (heartbeat::ocf:IPaddr2):   Started ha-1
   squid   (heartbeat::ocf:squid): Started ha-1 (unmanaged) FAILED

Failed actions:
   squid_stop_0 (node=ha-1, call=74, rc=1): Error

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Almost done with my HA setup, but somethign not working

2008-04-22 Thread Dominik Klein

Nick Duda wrote:
I rename the restart script for squid. 


Your OCF Script or your /etc/init.d script?

My current setup (based on 
examples on the web) show that if squid fails on the current runing 
server it will try to restart itself. If restart fails it will failover. 
So basically I am trying to make a test case scenario that if the squid 
startup script in /etc/init.d got deleted 


Ah, your /etc/init.d script.

Okay, look at your OCF script, what it does when /etc/init.d/squid is 
not there.


---
INIT_SCRIPT=/etc/init.d/squid

case  $1 in
   start)
   ${INIT_SCRIPT} start  /dev/null 21  exit || exit 1
   ;;

   stop)
   ${INIT_SCRIPT} stop  /dev/null 21  exit || exit 1
   ;;

   status)
   ${INIT_SCRIPT} status  /dev/null 21  exit || exit 1
   ;;

   monitor)
   # Check if Ressource is stopped
   ${INIT_SCRIPT} status  /dev/null 21 || exit 7

   # Otherwise check services (XXX: Maybe loosen retry / 
timeout)
   wget -o /dev/null -O /dev/null -T 1 -t 1 
http://localhost:3128/  exit || exit 1

   ;;

   meta-data)
--

So for the next monitor operation, it will exec
${INIT_SCRIPT} status  /dev/null 21 || exit 7

This will propably return 7. So the cluster thinks your resource is 
stopped. As it was running before (I guess?), the cluster will now try 
to stop and start it.


Stop calls
stop  /dev/null 21  exit || exit 1

This will return 1. So the stop operation failed.

With stonith, your node would be rebooted now. I don't see a stonith 
device, so the resource goes unmanaged.


I think what you see is intended.

Regards
Dominik

and squid crashed it should 
failover to the other box.its not.


Dominik Klein wrote:

Nick Duda wrote:

(sorry for the long email, but all my configs are here to view)

I posted before about HA with 2 squid servers. It's just about done, 
but stumbling on something. Everytime i manually cause something to 
happen in hopes to see it failover, it doesnt. For example, I get 
crm_mon to show everything as I want it, and when I kill squid (and 
prevent the xml from restarting it) it just goes into a failed 
state...more below. Anyone see anything wrong with my configs?


Server #1
Hostname: ha-1
eth0 - lan (192.168.95.1)
eth1 - xover to eth1 on other server

Server #2
Hostname: ha-2
eth0 - lan (192.168.95.2)
eth1 - xover to eth1 on other server

ha.cf on each server:

bcast eth1
mcast eth0 239.0.0.2 694 1 0
node ha-1 ha-2
crm on

Not using haresources because of crm

Here is the output from crm_mon:


Last updated: Mon Apr 21 15:44:53 2008
Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d)
2 Nodes configured.
1 Resources configured.


Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online
Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online

Resource Group: squid-cluster
   ip0 (heartbeat::ocf:IPaddr2):   Started ha-1
   squid   (heartbeat::ocf:squid): Started ha-1

If squid stops on the current heartbeat serer, ha-1, it will restart 
within 60sec...so the scripting is working. If i stop the squid 
process and rename it in /etc/init.d/squid to something else, the 
script wont be able to execute the squid start and should failover to 
ha-2, but it doesnt, instead this appears (on both ha-1 and ha-2):


What exactly do you rename and how? It's likely the cluster is 
behaving sane and you're just creating a testcase you don't understand.


Regards
Dominik



Last updated: Mon Apr 21 15:47:49 2008
Current DC: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d)
2 Nodes configured.
1 Resources configured.


Node: ha-1 (2422b230-22f2-451b-aa95-0b783eccab8d): online
Node: ha-2 (1691d699-2a81-4545-8242-b00862431514): online

Resource Group: squid-cluster
   ip0 (heartbeat::ocf:IPaddr2):   Started ha-1
   squid   (heartbeat::ocf:squid): Started ha-1 (unmanaged) FAILED

Failed actions:
   squid_stop_0 (node=ha-1, call=74, rc=1): Error

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




--

IN-telegence GmbH  Co. KG
Oskar-Jäger-Str. 125
50825 Köln

Registergericht Köln - HRA 14064, USt-ID Nr. DE 194 156 373
ph Gesellschafter: komware Unternehmensverwaltungsgesellschaft mbH,
Registergericht Köln - HRB 38396
Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Constraint: Two drdb master on the same node?

2008-04-11 Thread Dominik Klein

I have the following problem with a two-node cluster:

I have two DRBD resources. On the node where drbd0 is master, a certain 
resource group with different resources will be activated. On the node 
where drbd1 is master, this will happen with another resource group.


You can get the necessary constraints from the DRBD Howto:
http://wiki.linux-ha.org/DRBD/HowTov2

Now I want that the two DRBD resources are always master on the same 
node? How do I create such a constraint the cib/crm file?


rsc_colocation id=master_on_master to=ms-drbd3 to_role=master 
from=ms-drbd1 from_role=master  score=infinity/


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] New questions relating to: Methods of dealing with network fail(ure/over)

2008-04-11 Thread Dominik Klein

Stallmann, Andreas wrote:

Hi there!

I have set up a two-node heartbeat cluster running
apache and drbd. 


Everthing went fine, till we tested a split brain
scenario. In this case, when we detach both network
cables from one host, we get a two-primary situation.

I read in the thread methods of dealing with network failover
that setting up stonith and a quorum-node might be a 
good workaround.


Well... it isn't in our situation, I think.

Let's assume, we have the following scenario:

- The two nodes, having two interfaces each, monitor each 
other via unicast queries over both interfaces.

- We do not have any dedicated cross-over or serial connections,
because the servers reside in buildings a few kilometers appart. 
- We have only the two Linux nodes in our network which are

part of our cluster (well, a few more to be honest, but those
are the two we may fiddle arround with). 
- We won't be able to set up a (dedicated) quorum server.

- We do not have a network enabled power socket we might deactivate
for the node which we want to shot in the head.

Now someone stumbles over the network cables of, lets say, node-b, 
detaching it from the network.


node-b and node-a do not receive any unicast replies from their peer
anymore, but node-a can still ping it's ping host, while node-a can't.

node-b should now assume, that it's very likely dead.
node-a should assume can't be sure, because it can't reach it's peer
but still can reach the rest of the network (or at least it's ping
node).

Actually, I'd like to see the following happen:

- If a node is secondary and assumes, that it's very likely dead, 
it should not be allowed to take over any ressources.

- If a node is primary and isn't sure about it's peer, it should
freeze it's state at least till it's peer is reachable over one
interface.


That's about exactly what the dopd (drbd peer outdater daemon) is for.

Look into 
http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/


Dopd was rather unuseable for the last few weeks(/months?), but I read 
it recently received a bunch of fixes and is supposed to work now (as of 
the drbd-user mailing list and Lars Ellenberg (one of the main authors 
of drbd)).


Refer to that mailing-list and the blog entry. I'd be glad if you told 
us how this worked out.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Three questions on failcount attribute

2008-04-10 Thread Dominik Klein

[EMAIL PROTECTED] sbin # ./crm_failcount -G -U isdl601 -r caebench.proc
name=fail-count-caebench.proc value=(null)
Error performing operation: The object/attribute does not exist


 Is this intentional?


 At least the normal behaviour.


in that version


Ah right. crm_failcount gives a reasonable answer now - I just did 
cibadmin -Q|grep fail and never found anything with 0 failures :)






From a consistency point of view, creating it with a value of 0 would

make sense. RFE?

b) It would be nice to have crm_mon display the failcounts for all

resources on all nodes, or mark nodes with any failcount  0 as failed or
online/failed. RFE?
 Yep, would be nice. Request that as an enhancement in the bugzilla.


which is what happens in newer version :)


I agree one can see a failed operation. But the failcount in crm_mon?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to config to the cluster to perform a simple two node (master-slave) cluster?

2008-04-10 Thread Dominik Klein
Hi

to run all resources on the same node, you could put them in a group.
Read http://wiki.linux-ha.org/ClusterInformationBase/ResourceGroups

If you want to decide which node the group is usually located at, you
need a rsc_location constraint. An example is also on that page.

To move the group to another node after a resource failure, you have to
look into resource_failure_stickiness. Read
http://wiki.linux-ha.org/ScoreCalculation.

Regards
Dominik

[EMAIL PROTECTED] wrote:
 hi,
 i want to construct a two-node cluster.
 1. firstly, all of the resources is started up on the DC node.
 2. when some of critical resources  error occured , all of the 
resources is move to the other node.
 3. all resources must run on only one node in its whole lifetime.
until error occured they are moved to the other node.
 
 i know some of  attributes(like: resource stickiness) can locate the resource
 in one place.
 
 any more advice ?
 
 thank you
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ordering question

2008-04-09 Thread Dominik Klein

William Francis wrote:

http://linux-ha.org/DRBD/HowTov2

it has the example

 rsc_order id=drbd0_before_fs0 from=fs0 action=start
to=ms-drbd0 to_action=promote/


This reads: start fs0 after ms-drbd0 promote


which seems to mean promote ms-drbd0 (the to) THEN start fs0 (the from)

But if you look at this page

http://www.linux-ha.org/ClusterInformationBase/ResourceGroups

it has, for example

rsc_order id=database_before_apache from=WebServerDatabase
action=start type=before to=WebServerApache symmetrical=TRUE/


Notice type=before, which is not the default.
So this reads: start Webserverdatabase before Webserverapache

You can set type to after or before. after is the default.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Three questions on failcount attribute

2008-04-09 Thread Dominik Klein

Martin Knoblauch wrote:

Hi,

 three questions on the failcount attribute. I am running 2.0.8, and yes I 
know I should upgrade ... :-(


Good to know you know :)


a) Is it possible that the failcount for a ressource/node is only available 
after a failure? On a not-yet-failed ressource I see:


[EMAIL PROTECTED] sbin # ./crm_failcount -G -U isdl601 -r caebench.proc
name=fail-count-caebench.proc value=(null)
Error performing operation: The object/attribute does not exist


 Is this intentional? 


At least the normal behaviour.


From a consistency point of view, creating it with a value of 0 would make 
sense. RFE?

b) It would be nice to have crm_mon display the failcounts for all resources on all nodes, or mark 
nodes with any failcount  0 as failed or online/failed. RFE?


Yep, would be nice. Request that as an enhancement in the bugzilla.


c) Something I likley will be shot for :-) If I set the failcount of a ressource/node to 
e.g. -5, does this mean that the resoource can fail 6 times before the node 
is no longer eligible to run the ressource?


No.

Something like this can be achieved with resource_failure_stickiness. 
Read http://wiki.linux-ha.org/ScoreCalculation


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_failcount queries quite slow?

2008-04-04 Thread Dominik Klein

Lars Marowsky-Bree wrote:

On 2008-04-03T13:59:36, Dejan Muhamedagic [EMAIL PROTECTED] wrote:


Any crm* program is significantly slower on a non-DC node
regardless of whether something's happening in the cluster. It's
always been like that.


I can confirm that. It's been for me ever since I started using heartbeat.


Hm, I've not personally observed that in my test cluster, or at least
not noticed anything out of line.

Significantly slower is bad; we mandate that DC or not DC is _not_
the question, and that users shouldn't care about this designation.

Could anyone who reproduces this report a few more details? Is it the
local node, the time it takes to process on the DC, or the network
roundtrip? (Should be observable using tcpdump/wireshark)


Just 2 measurements:

dktest2sles10:~# time crmadmin -D
Designated Controller is: dktest2sles10

real0m0.005s
user0m0.004s
sys 0m0.000s

dktest1sles10:~/cib# time crmadmin -D
Designated Controller is: dktest2sles10

real0m1.014s
user0m0.000s
sys 0m0.004s

dktest2sles10:~# time cibadmin -Q  /dev/null

real0m0.009s
user0m0.004s
sys 0m0.004s

dktest1sles10:~/cib# time cibadmin -Q  /dev/null

real0m1.713s
user0m0.004s
sys 0m0.004s

tcpdump:

y.x.z.103 is the DC
y.x.z.102 is the other node

08:22:16.803702 IP 10.200.200.102.32952  10.200.200.103.694: UDP, 
length 217
08:22:16.803626 IP 10.250.250.102.32951  10.250.250.103.694: UDP, 
length 221
08:22:16.803637 IP 10.250.250.102.32951  10.250.250.103.694: UDP, 
length 217
08:22:16.929482 IP 10.250.250.103.32869  10.250.250.102.694: UDP, 
length 221
08:22:16.929528 IP 10.200.200.103.32870  10.200.200.102.694: UDP, 
length 221


up to here, it's been just the normal heartbeat packets I think. Notice 
the roughly identical length.


Then I do:

debian dktest1sles10:~/cib# date +%H:%M:%S:%N; time cibadmin -Q  /dev/null
08:22:16:04482

real0m1.189s
user0m0.008s
sys 0m0.00

08:22:16.929976 IP 10.250.250.103.32869  10.250.250.102.694: UDP, 
length 2263
08:22:16.930026 IP 10.200.200.103.32870  10.200.200.102.694: UDP, 
length 2263

08:22:16.930029 IP 10.200.200.103  10.200.200.102: udp
08:22:16.929979 IP 10.250.250.103  10.250.250.102: udp

Both servers received an ntpdate sync against the same timesource a 
minute earlier. So to me, it looks like it's the DC who needs some time 
to process the request. The cluster had one primitive resource at that 
time and should have been pretty much idle.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] lsb resource problem

2008-04-03 Thread Dominik Klein

Hi

William Francis wrote:

Ubuntu 7.10 with DRBD 8.0.3 and Heartbeat 2.1.2 with an updated Filesystem file

kernel 2.6.22-14 (updated from stock)

I have possibly two problems, a heartbeat and a DRBD issue. My goal is
to get a pair of machines working with a large /opt partition for
zimbra (my mail server software) and an a virtual IP.


1.  I can configure heartbeat and DRBD with an virtual IP with no
problems at all. I can start and stop heartbeat on the two machines
and because of the colocations I have set up the resources move around
properly with no problems. If I start zimbra manually on the machine
that currently has the /opt partition mounted and the virtual IP it
works with no problem (I installed it with no issues).

I then add Zimbra, a lsb resource, like so:

primitive id=zimbra class=lsb type=zimbra/


You did check that this script is LSB compliant? If not, see 
http://wiki.linux-ha.org/LSBResourceAgent and change the script if 
necessary.



in crm_mon, I can see it start the zimbra resource (on the machine
with the other resources). However, after several seconds it reports a
failure and I see something like this in crm_mon
Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd):  Master d243
drbd0:1 (heartbeat::ocf:drbd):  Started d242
fs0 (heartbeat::ocf:Filesystem):Started d243
ip_resource (heartbeat::ocf:IPaddr):Started d243
zimbra  (lsb:zimbra):   Started d243 (unmanaged) FAILED

Failed actions:
zimbra_start_0 (node=d243, call=7, rc=1): Error
zimbra_stop_0 (node=d243, call=8, rc=1): Error


Well, as you can see, the start operation failed. Therefore, the 
resource is stopped afterwards (notice the larger call number). But 
the stop operation also failed. So as the cluster cannot say what status 
this resource is in, it will not be touched anymore.


Actually, if you had stonith configured, your node would be rebooted 
now, but that's another topic.




It should be noted that zimbra takes a long time to start and stop,
maybe as long a two minutes 


Then you should set an appropriate value for timeout in the start 
operation. Something like this:


primitive ...
operations
op id=zimbra-start-op name=start timeout=120s/
/operations
/primitive


since it launches many sub processes. If
there is a way to take that into account, I don't know where to do it.



Also, I have made rsc_order and rsc_colocation  constraints but I have
the same results as here. If I start zimbra but it's init.d script and
then 'echo $?' it returns 0 and starts properly.


That's because the default timeout is 20s (or something in that range at 
least) and that seems to not be enough for you.



What I don't get is that it looks like it's trying to start zimbra
before DRBD is active even though I have a rsc_order set not to do so.
The constraints are below and I've included a small part of the logs
at the bottom. It seems to fail because it can't write out to a file
on /opt, which it can't do because it's not mounted.


2. Let's say I restart heartbeat on the other machine. DRBD does not
seem to reconnect properly and I get stuck with them in
WFReportParams/WFBitMapT and I have yet to find a way outside of
rebooting one machine to fix this. this only happens when I have
zimbra as a resource and when nothing is really using /opt I can
switch back and forth with no problems. I've seen some reports that
this might be DRBD/kernel version problem but it seem like most of
those were under DRBD 7.


For master/slave resources, you should use a newer version of heartbeat 
and especially of the crm (which is now called pacemaker and needs to be 
installed separately). To install newer version, please read 
http://www.clusterlabs.org/mw/Install and use this pacemaker version: 
http://hg.clusterlabs.org/pacemaker/stable-0.6/archive/tip.tar.gz



I have removed all files in the rc*.d directories for drbd and zimbra.
much of this was taken directly from faqs and howtos

I will happily provide logs or other debugging info. configs to follow



[EMAIL PROTECTED]:/root/tmp# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-25 00:46:06
 0: cs:WFBitMapT st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0


[EMAIL PROTECTED]:/etc/init.d# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2008-03-24 16:02:09
 0: cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown C r---
ns:4 nr:42960 dw:43504 dr:45105 al:0 bm:7 lo:2 pe:0 ua:0 ap:1
resync: used:0/31 hits:51 misses:7 starving:0 dirty:0 changed:7
act_log: used:1/257 hits:136 misses:1 starving:0 dirty:0 changed:0



drbd.conf

global {
usage-count yes;
}
common {
  syncer { rate 50M; }
}
resource drbd0 {
  protocol C;
  

Re: [Linux-HA] Re: pingd problem

2008-04-03 Thread Dominik Klein

Achim Stumpf wrote:

Hi,

Now it works. I have changed in cib.xml:

rsc_location id=group_1:connected rsc=group_1
rule id=group_1:connected:rule 
score_attribute=pingd 
expression 
id=group_1:connected:expr:defined attribute=pingd operation=defined/

/rule
/rsc_location


to


rsc_location id=group_1:connected rsc=group_1
rule id=group_1:connected:rule 
score=-INFINITY boolean_op=or
expression 
id=group_1:connected:expr:undefined attribute=pingd 
operation=not_defined/
expression 
id=group_1:connected:expr:zero attribute=pingd operation=lte 
value=0/

/rule
/rsc_location


Now it works as expected. Those two setups were described on:

http://www.linux-ha.org/pingd

But still it would be nice to get pingnodes working with scores as made 
in my first example or described on taht page in Quickstart - Run my 
resource on the node with the best connectivity.


Does anyone have any hints how to get that stuff working?


That should work - what did you see (and what did you expect)? You 
propably need to adjust the pingd multiplier. showscores.sh helps a lot 
here to figure out what's not as expected (see 
http://www.linux-ha.org/ScoreCalculation )


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: pingd problem

2008-04-03 Thread Dominik Klein
/primitive
primitive class=ocf id=apache2_1 
provider=heartbeat type=apache

operations
op id=apache2_1_mon 
interval=60s name=monitor timeout=55s/

/operations
instance_attributes 
id=apache_2_1_inst_attr

attributes
nvpair 
id=apache_2_1_attr_0 name=configfile 
value=/etc/httpd/conf/httpd.conf/
nvpair 
id=apache_2_1_attr_1 name=statusurl 
value=http://127.0.0.1/server-status//
nvpair 
id=apache_2_1_attr_2 name=options value=-DSSL/

/attributes
/instance_attributes
/primitive
instance_attributes 
id=group_1_instance_attrs

attributes
nvpair 
id=group_1_target_role name=target_role value=started/
nvpair 
id=group_1_resource_stickiness name=resource_stickiness value=200/

/attributes
/instance_attributes


Apart from the fact that these attributes should be meta_attributes 
instead of instance_attributes, this will give you a score of 4 * 200 
= 800 for the node the group is actually running on.


So with ping working, you should have scores of
800 + 500 for node1
500 for node2

Now you block icmp on node1. You will have:

800 on node1
500 on node2

So why should the cluster move any resource?

Regards
Dominik


/group
/resources
constraints
rsc_location id=rsc_location_group_1 
rsc=group_1
rule id=prefered_location_group_1 
score=100
expression attribute=#uname 
id=prefered_location_group_1_expr operation=eq value=sputnik.test/

/rule
/rsc_location
rsc_location id=group_1:connected rsc=group_1
rule id=group_1:connected:rule 
score_attribute=pingd 
expression 
id=group_1:connected:expr:defined attribute=pingd operation=defined/

/rule
/rsc_location
/constraints
/configuration
status/
/cib


It would be nice to acutally get this setup working with scores, but it 
does not work as you see in the logs. There won't be a failover to the 
other node.


Any hints, how i could get working the setup with scores as above?


Thanks,

Achim




Dominik Klein schrieb:

Achim Stumpf wrote:

Hi,

Now it works. I have changed in cib.xml:

rsc_location id=group_1:connected 
rsc=group_1
rule id=group_1:connected:rule 
score_attribute=pingd 
expression 
id=group_1:connected:expr:defined attribute=pingd 
operation=defined/

/rule
/rsc_location


to


rsc_location id=group_1:connected 
rsc=group_1
rule id=group_1:connected:rule 
score=-INFINITY boolean_op=or
expression 
id=group_1:connected:expr:undefined attribute=pingd 
operation=not_defined/
expression 
id=group_1:connected:expr:zero attribute=pingd operation=lte 
value=0/

/rule
/rsc_location


Now it works as expected. Those two setups were described on:

http://www.linux-ha.org/pingd

But still it would be nice to get pingnodes working with scores as 
made in my first example or described on taht page in Quickstart - 
Run my resource on the node with the best connectivity.


Does anyone have any hints how to get that stuff working?


That should work - what did you see (and what did you expect)? You 
propably need to adjust the pingd multiplier. showscores.sh helps a 
lot here to figure out what's not as expected (see 
http://www.linux-ha.org/ScoreCalculation )


Regards
Dominik

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: pingd problem

2008-04-03 Thread Dominik Klein

Achim Stumpf wrote:
This will give you a pingd score of 500. A ping_group is treated as 
one ping_host score wise.


If you want to take each ping hosts connectivity into play, you should 
have

ping 10.14.0.10
ping 10.14.0.11
ping 10.14.0.12
ping 10.14.0.13
instead. This would give a pingd score of 2000 (and make your setup 
work score-wise). 


I know that ping_group is treated as one ping_host score wise. I expect 
the score to be 500 for that.


instance_attributes 
id=group_1_instance_attrs

attributes
nvpair 
id=group_1_target_role name=target_role value=started/
nvpair 
id=group_1_resource_stickiness name=resource_stickiness 
value=200/

/attributes
/instance_attributes


Apart from the fact that these attributes should be meta_attributes 
instead of instance_attributes, this will give you a score of 4 * 
200 = 800 for the node the group is actually running on.


So with ping working, you should have scores of
800 + 500 for node1
500 for node2

Now you block icmp on node1. You will have:

800 on node1
500 on node2

So why should the cluster move any resource? 


Ah, ok. so its better to change this to:

meta_attributes id=group_1_instance_attrs
attributes
nvpair id=group_1_target_role name=target_role 
value=started/
nvpair id=group_1_resource_stickiness 
name=resource_stickiness value=200/

/attributes
/meta_attributes

And the score of 200 counts for every primitive in the group. Ok, so 
it's 800. I thought it counts only one time.


Again, read http://wiki.linux-ha.org/ScoreCalculation

Apart from my misunderstanding here with 4*200 score, does my setup work 
score-wise now? Or do I miss anything?


I don't know the config you use now, but if you address the score 
issue I pointed out, I guess it should. The cib looked ok.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cibadmin question

2008-04-02 Thread Dominik Klein

Jason Erickson wrote:

I am trying to add a resource with this command.
cibadmin -C -o resources -x meatware_stonithcloneset.xml

It is telling me  could not parse input file


here is the xml file as well.


clone id=meat_stonith_cloneset

−


This - is not actually in there, is it? If it is, get rid of it.
Other than that - it works for me.

Which version are you using?

Regards
Dominik


instance_attributes id=meat_stonith_cloneset

−

attributes

nvpair id=meat_stonith_cloneset-01 name=clone_max value=2/

nvpair id=meat_stonith_cloneset-02 name=clone_node_max value=1/

nvpair id=meat_stonith_cloneset-03 name=globally_unique 
value=false/


nvpair id=meat_stonith_cloneset-04 name=target_role value=started/

/attributes

/instance_attributes

−

primitive id=meat_stonith_clone class=stonith type=meatware 
provider=heartbeat


−

operations

op name=monitor interval=5s timeout=20s prereq=nothing 
id=meat_stonith_clone-op-01/


op name=start timeout=20s prereq=nothing 
id=meat_stonith_clone-op-02/


/operations

−

instance_attributes id=meat_stonith_clone

−

attributes

nvpair id=meat_stonith_clone-01 name=hostlist value=lin lor/

/attributes

/instance_attributes

/primitive

/clone


Jason Erickson

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_failcount queries quite slow?

2008-04-02 Thread Dominik Klein

Abraham Iglesias wrote:

Hi all,
i'm trying to implemente my snmp mib module to get every resource 
failcount in the cluster. I'm surprised that the crm_failcount query to 
get the failcount for a resource takes 2-3 seconds. Then, for 8 
resources in the cluster it takes 16-24s and it's quite low performant.


Is it normal that the failcount query takes so long?


Run it on the DC. Should be way faster there.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_failcount queries quite slow?

2008-04-02 Thread Dominik Klein

Abraham Iglesias wrote:

thank you for the answer, dominik.

As you said, in the DC is faster, but i need to run these queries on 
every node, as every node can be asked for that information. :S


crm_failcount -U
;)

Is there any cached data about these values? Or a static file where the 
results are stored?


You could also grep from /var/lib/heartbeat/crm/cib.xml

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm_failcount queries quite slow?

2008-04-02 Thread Dominik Klein

Abraham Iglesias wrote:

Hi,
thanks for the failcount tip ;)

By thw way, i'm using 2.0.8 heartbeat. There is no information in 
cib.xml about failcounts...is it possible? or am I missing anything?


No. I was wrong. The failcount is in the status section, which is never 
written to disc. Sorry about that.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA maintenance mode

2008-03-31 Thread Dominik Klein
I don't see an option to specify the behavior for the stand-by mode in 
the manual. I just wanna prevent HA from moving resources to other nodes 
for maintenance purpose.


So basically, you want to stop all resources at once, don't you?

Here's a feature request:

http://developerbugs.linux-foundation.org/show_bug.cgi?id=1862

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] adding DRBD group problem

2008-03-28 Thread Dominik Klein

 B) put everything in a group in the master_slave resource?


... never tried this by myself


I don't think this would work without changing the all the groups 
resource agents to be master/slave resource agents.


Every resource within the master_slave would be promoted/demoted etc. 
and that is propably not possible out of the box.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: Re: showscores.sh weirdness and Not failing over after, repeated kills of IPaddr2?

2008-03-27 Thread Dominik Klein

Roland G. McIntosh wrote:

Dominik Klein wrote:
  With a failure stickiness of -30, you allow your groups resources to
  fail (400/30)=14 times. Is that what you want?

Although the default failure stickiness is -30, the group has a failure 
stickiness of -100.  I would like to failover after 3 or 4 failures.

My test with 15 stop commands was just to be sure.


Well, with your new cib, that should work as intended. The cib of your 
first email would not and what else should I refer to :)



  You don't have any monitor operations for the ipaddr and jboss
  resources. Failures on them are not detected. Configure monitor
  operations and try again.

I actually do have monitor operations on both, I accidentally sent out 
an old cib.xml, updated file attached.  Previous showscores.sh output is 
correct for this cib.xml. It behaves as described in the last e-mail 
even with monitor operations on IPaddr2 and jboss.


  Also make sure you use a recent version.Otherwise you may also hit the
  bug of not increasing failcount in 2.1.3's crm. This is fixed in
  pacemaker (0.6.x)

Uh oh.  I definitely have 2.1.3 with the crm_failcount bug, but I didn't 
think this would affect score calculation. 


If the failcount is not updated, the score is not changed when a 
resource fails again. Read http://wiki.linux-ha.org/ScoreCalculation


I didn't install a pacemaker 
package, I used the CentOS4 extras RPMs.  I hope CentOS4 / RHEL4 
packages can be released. I could not rebuild RHEL5 packages from the 
openSUSE ha-clustering repository due to this:


configure:3065: gcc -c  -O2 -g  conftest.c 5
conftest.c:2: error: syntax error before me
configure:3071: $? = 1
configure: failed program was:
| #ifndef __cplusplus
|choke me
| #endif

Should I seek an alternative to these CentOS 4 extras RPMs?


Can't give any advice on that. Any RH/CentOS users?


  pps. where did you get the jboss RA? I'd be interested in it.

http://rgm.nu/jbossocf

I hacked it together, it's ugly.  Hope it's useful to you, although it's 
tailored to a very old jboss release 3.0.8, with some customizations to 
support multiple instances of jboss on different ports.  Relies on ps, 
awk, egrep, and curl, only tested on RHEL4.  You'll want to change the 
HTTPCODE check with an URL for your servlet (or modify to use jmx-console).


Thanks. I will have a look at it and see if I can make it work for me :)

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] slave's drbd resource doesn't get promote when master dies

2008-03-27 Thread Dominik Klein

I think I have found out my problem though: I didn't put the resource
location stuff for pingd. I added this snippet to the CIB to constrain
the master-slave drbd resource to not run on a node with lost
connectivity and so far in my tests it seems to work:

rsc_location id=drbd_id:connected rsc=ms-drbd_id
  rule id=drbd_id:connected:rule score=-INFINITY boolean_op=or
expression id=drbd_id:connected:expr:undefined attribute=pingd 
operation=not_defined/
expression id=drbd_id:connected:expr:zero attribute=pingd operation=lte 
value=0/
  /rule
/rsc_location


Slightly OT, but with this config: consider that if all your ping nodes 
are down, your resource will not run anywhere. If thats okay with you, 
stay with that config.


Otherwise you might want to set something like this:

   rsc_location id=res-connected rsc=res
 rule role=master id=res-connected-rule score_attribute=ping1
   expression id=res-connected-rule-1 attribute=pingd 
operation=defined/

 /rule
   /rsc_location

This will *add* the pingd attributes value to the score for the master 
role. If pingd loses connectivity, this will be expressed in a lower 
score (as the pingd attribute value is 0 then). If the other node has a 
higher score, the resource will be migrated.


You will have to play around with the pingd multiplier according to your 
number of ping nodes to get those values right. You can use my 
showscores script to see the scores and test: 
http://wiki.linux-ha.org/ScoreCalculation#head-4355c45fc51c60c8e0f8a063bdc4069fdc17f761



and when I put the master in stanby the resources are correctly
migrated. Same goes when I poweroff the master or yank the eth0
network cable. I still haver issues about failover as it seems that
'auto_failback off' is not honored correctly.


That is a R1 configuration option. It does nothing in R2 (crm) mode. 
Read http://wiki.linux-ha.org/ScoreCalculation and set an appropriate 
value for resource stickiness. This way you can achieve a behaviour as 
with auto_failback off in R1.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Virtual IP

2008-03-27 Thread Dominik Klein

Jason Erickson wrote:
The only part that i am confused about is where do you set the resource 
score for a node?


Within the constraints section of the cib.

Something like this:

 constraints
   rsc_location id=rscloc-webserver rsc=webserver
   rule id=rscloc-webserver-rule-1 score=200
   expression id=rscloc-webserver-expr-1 
attribute=#uname operation=eq value=node1/

   /rule
 /constraints

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] showscores.sh weirdness and Not failing over after repeated kills of IPaddr2?

2008-03-26 Thread Dominik Klein

Hi

Roland G. McIntosh wrote:
No matter how many times I kill IPaddr2 I can't seem to cause a failover 
in my simple 2 node cluster.


OT, but why do people keep calling 2 node clusters simple clusters? 
Clusters are not simple. Maybe it's a rather small cluster.


I'm trying to get it working for the 3 services in my group (HB 2.1.3 on 
RHEL4 using CentOS packages).  I don't understand why showscores.sh 
shows INFINITY for my OCF resources, but an integer value for the 
IPaddr2 resource.


This is expected. In a colocated group, only the first resource receives 
the configured integer stickiness value (times the number of resources 
in that group). Read below.



Here is the output of my showscores.sh:

[EMAIL PROTECTED] rss]$ ./showscores.sh
ResourceScore NodeStickiness #Fail 
Fail-Stickiness

slink_db-INFINITY slinkfail   1000-30
slink_dbINFINITY  slinkmaster 1000-30
slink_ipaddr2   0 slinkfail   1000-30
slink_ipaddr2   400   slinkmaster 1000-30


As you see here. You have a node preference of 100 plus 3 * 100 stickiness.


slink_jboss -INFINITY slinkfail   1000-30
slink_jboss INFINITY  slinkmaster 1000-30


The INFINITY is implicitly given by the colocated group - that way your 
resources run on the same node. -INFINITY is to make sure they dont run 
on any other node than the one the first resource was started on.


With a failure stickiness of -30, you allow your groups resources to 
fail (400/30)=14 times. Is that what you want?


I'm using the Mar 2008 version of showscores.sh (thanks Dominik!), so 
perhaps this is related to the known issue of meta attributes on the 
group instead of on the primitive.


From your config - no, it's not about that, as you don't have a 
stickiness meta attribute for the group, just default values.



I've been trying to force a failover like this:

export OCF_RESKEY_ip=192.168.1.222
for nn in `seq 1 15`; do
  /usr/lib/ocf/resource.d/heartbeat/IPaddr2 stop
  sleep 1m
done

After one the score becomes 200.
Then it seems to jump back up to 300 and stays there.  It never proceeds 
down below zero as I expect.  I have a colocation constraint, as you can 
see in my cib.xml.


You don't have any monitor operations for the ipaddr and jboss 
resources. Failures on them are not detected. Configure monitor 
operations and try again.


Also make sure you use a recent version. Otherwise you may also hit the 
bug of not increasing failcount in 2.1.3's crm. This is fixed in 
pacemaker (0.6.x)


This line from your config:

   rsc_colocation id=colocation_MyGroup from=MyGroup 
to=MyGroup score=INFINITY/


is not needed. I don't even know what you want to express with this.

Regards
Dominik

ps. I'll add the group score things to 
http://www.linux-ha.org/ScoreCalculation soon.


pps. where did you get the jboss RA? I'd be interested in it.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD problems

2008-03-26 Thread Dominik Klein

Néstor wrote:

I am getting this errors when running the command drbdadm adjust mysql on
WAHOO:
*Failure: (114) Lower device is already claimed. This usually means it is
mounted.


Well, is it?


Command 'drbdsetup /dev/drbd0 disk /dev/sda2 /dev/sda2 internal
--set-defaults --create-device --on-io-error=detach' terminated with exit
code 10*

And on my second node WAHOO2, I get:
*No response from the DRBD driver! Is the module loaded?
Command 'drbdsetup /dev/drbd0 disk /dev/sda2 /dev/sda2 internal
--set-defaults --create-device --on-io-error=detach' terminated with exit
code 20*


Saw this a couple of times, too. Which drbd version are you using?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD problems

2008-03-26 Thread Dominik Klein

Néstor wrote:

Version 8.2.5

I think is telling me that the device is already mounted.


Right. Is it?


What I do not understand them is how to pick a directoy or device.  Do I
need to re-partition
my device to create a separate device for drbd or can I pick just a
directory within the device
partition that I want to use.


You can use an existing partition or create a new one if you have space 
left to do so. Drbd cannot be used on files or directories.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help with failure-stickiness

2008-03-22 Thread Dominik Klein

Roland G. McIntosh schrieb:
I've got 3 resources in a group, and I'd like to configure stickiness 
values such that if there are more than 3 failures in the group all 
resources go to the failover node.  I've read 
http://www.linux-ha.org/v2/faq/forced_failover many times, but do not 
quite understand from that how to configure my cluster stickiness 
values 


Did you read
http://www.linux-ha.org/ScoreCalculation?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] slave's drbd resource doesn't get promote when master dies

2008-03-20 Thread Dominik Klein

Jean-Francois Malouin wrote:

I thought I had it nailed but still no go.

I'm running a simple two-nodes Active/Passive, Debian/Etch cluster
with apache, mysql, heartbeat-2.1.3 and drbd-8.2.5 using mcast on the
primary NIC and bcast on secondary GigE interfaces which is also the
replication link for drbd. I also setup a serial link between the
nodes. I've setup dopd as per the drbd user guide and Florian's blog
and seems to work as documented when I 'ifconfig down eth1' or do
other nasty things with the x-over cable.

I can migrate the drbd master manually back and forth between the
nodes (feeble-0 and feeble-1)  but if I crm_standy the master or
shutdown/pull the plug on the primary then secondary doesn't get
promoted and the drbd get split-brained and I must then manuallly
untangle the mess. My cib is obviously not correct but my brain is
having a hard time parsing the xml...any pointers please?


The xml looks good to me.


Log show after attempting a crm_standby:

pengine[5003]: 2008/03/19_16:55:58 info: unpack_nodes: Node feeble-1 is in 
standby-mode
pengine[5003]: 2008/03/19_16:55:58 info: determine_online_status: Node feeble-1 
is standby
pengine[5003]: 2008/03/19_16:55:58 info: determine_online_status: Node feeble-0 
is online
pengine[5003]: 2008/03/19_16:55:58 WARN: unpack_rsc_op: Processing failed op 
drbd_id:0_promote_0 on feeble-0: Error


Find out why this failed.


pengine[5003]: 2008/03/19_16:55:58 notice: clone_print: Master/Slave Set: 
ms-drbd_id
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: drbd_id:0 
(heartbeat::ocf:drbd):  Master feeble-0 FAILED
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: drbd_id:1 (heartbeat::ocf:drbd):  Stopped 
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: fs_id (heartbeat::ocf:Filesystem):Stopped 
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: ip_id (heartbeat::ocf:IPaddr):Stopped 
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: mysql_id (heartbeat::ocf:mysql): Stopped 
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: apache_id (heartbeat::ocf:apache):Stopped 
pengine[5003]: 2008/03/19_16:55:58 notice: native_print: email_id (heartbeat::ocf:MailTo):Stopped 
pengine[5003]: 2008/03/19_16:55:58 WARN: native_color: Resource drbd_id:1 cannot run anywhere


2 node cluster, one node in standby, failed start on the other node, 
that means the resource cannot run anywhere.



cib.xml resources and constraints sections:

resources
  master_slave id=ms-drbd_id
meta_attributes id=ma-ms-drbd1_id
  attributes
nvpair id=ma-ms-drbd-1_id name=clone_max value=2/
nvpair id=ma-ms-drbd-2_id name=clone_node_max value=1/
nvpair id=ma-ms-drbd-3_id name=master_max value=1/
nvpair id=ma-ms-drbd-4_id name=master_node_max value=1/
nvpair id=ma-ms-drbd-5_id name=notify value=yes/
nvpair id=ma-ms-drbd-6_id name=globally_unique value=false/
nvpair id=ma-ms-drbd-7_id name=target_role value=started/
  /attributes
/meta_attributes
primitive id=drbd_id class=ocf provider=heartbeat type=drbd
  operations 
op id=drbd-monitoring interval=30s name=monitor timeout=15s/


You might want to monitor both the slave and the master here.

  operations
op id=op1 name=monitor interval=5s timeout=5s 
role=Master/
op id=op2 name=monitor interval=6s timeout=5s 
role=Slave/

  /operations

Make sure you use different intervals, because multiple monitor 
operation with the same interval on one resource are not supported.



  /operations
  instance_attributes id=ia-drbd_id
attributes
  nvpair id=drdb-resource_id name=drbd_resource value=r0/
/attributes
  /instance_attributes
/primitive
  /master_slave
  primitive id=fs_id class=ocf provider=heartbeat type=Filesystem
operations
  op id=Filesystem_Monitoring interval=10s name=monitor 
timeout=30s/
/operations
instance_attributes id=ia-fs_id
  attributes
nvpair id=ia-fs-1_id name=fstype value=ext3/
nvpair id=ia-fs-2_id name=directory value=/export_www/
nvpair id=ia-fs-3_id name=device value=/dev/drbd1/
  /attributes
/instance_attributes
  /primitive
  primitive id=ip_id class=ocf provider=heartbeat type=IPaddr
operations
  op id=ip-monitoring interval=10s name=monitor timeout=30s/
/operations
instance_attributes id=ia-ip_id
  attributes
nvpair id=ip_id name=ip value=132.206.178.80/
  /attributes
/instance_attributes
  /primitive
  primitive id=mysql_id class=ocf provider=heartbeat type=mysql
operations
  op id=mysql-monitoring interval=10s name=monitor timeout=30s/
/operations
  /primitive
  primitive id=apache_id class=ocf provider=heartbeat type=apache
operations
  op id=apache-monitoring interval=10s name=monitor timeout=30s/
/operations
  /primitive
  primitive id=email_id class=ocf 

Re: [Linux-HA] Demote primary when connectivity lost.

2008-03-20 Thread Dominik Klein

Guy wrote:

Hi guys,

After much fiddling and learning (still loads to do though) I've got
my 2 node primary/secondary   secondary/primary set more or less
working. One failure of node 1, node 2 takes both drbd partitions as
primary and mounts the partitions and nfs etc etc.
When node 1 is brought back up I wait for node 2 to sync back over the
old primary on node 1 and then use crm_resource to move the one
primary back to node 1. Is there any way to do this automatically?
Wait for drbd to sync and then push primary back to node 1? This is
just curiosity, doing it manually gives me a chance to see that all is
well.

The problem I've really got is if one node just loses connectivity.
I've played around with dopd and pingd but this hasn't given the
desired results. I have one interface into the network and one
connecting the machines by crossover.
If node 1 loses connectivity (with dopd running) it fences /dev/drbd0
on node 2, thus stopping node 2 from taking it as primary. What I
really need it to do is force any primary partitions into secondary
mode if the node loses connectivity so that the live node can sync
back to it after recovery of connectivity. I don't see that dopd can
help me with this, so do I make some sort of constraint with pingd to
demote the partitions if there's no connectivity?


Just set a score of -infinity for the master role when pingd attribute 
is 0 or undefined.


sth like

rsc_location id=my_resource:connected rsc=my_resource
  rule role=master id=my_resource:connected:rule score=-INFINITY 
boolean_op=or

expression id=my_resource:connected:expr:undefined
  attribute=pingd operation=not_defined/
expression id=my_resource:connected:expr:zero
  attribute=pingd operation=lte value=0/
  /rule
/rsc_location


I have location constraints putting one primary partition on each
node, so would I need to do something with the scoring to ensure that
demoted partitions stayed that way until resync by drbd was done?


That does not seem possible right now.

I'd go with keeping the primary on the second node until you manually 
verified drbd has synced and then migrate manually.



I've attached my conf files. As you can see the only constraints I
currently have are the location preferences for the primary partitions
and the colocation and order constraints to ensure the groups for the
fs, nfs and ipaddr only start on the appropriate node.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups

2008-03-14 Thread Dominik Klein

Adrian Chapela wrote:

Dominik Klein escribió:

Hi

during the writeup of ScoreCalculation on the wiki, I noticed 
something strange. It'd be nice to know whether this is on purpose or 
a bug.


Test setup is:

2 nodes, 1 drbd device, a group of 3 resource which are to run on top 
of the drbd master.


resource_stickiness is set to 100

If I use a colocation constraint with a score of infinity, the master 
receives a stickiness bonus of 600.


If I change the colocation score to a numeric value (I tested 1000 and 
5000), the bonus is reduced to 400.


I could explain the 600 as 2 * num_resources * stickiness, but I 
cannot see where those 400 come from.


Is this a bug or (why?) is this intended?
I think there is a BUG related with master_slave resources. I have 
opened this bug: 
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1852 and today 
I have no response... :(


Is this which are you talking about ?


No, but I experienced that as well. I don't know why it happens, but I 
think you can get around it.


Please try this:

   !-- make clone instance :0 run on node1 --
   rsc_location id=rscloc-ms-drbd1:0 rsc=drbd1:0
 rule id=rscloc-ms-drbd1:0-rule1 score=500
   expression id=rscloc-ms-drbd1:0-rule1-expr 
attribute=#uname operation=eq value=node1/

 /rule
 rule id=rscloc-ms-drbd1:0-rule2 score=-500
   expression id=rscloc-ms-drbd1:0-rule2-expr 
attribute=#uname operation=eq value=node2/

 /rule
   /rsc_location
   !-- make clone instance :1 run on node2 --
   rsc_location id=rscloc-ms-drbd1:1 rsc=drbd1:1
 rule id=rscloc-ms-drbd1:1-rule1 score=500
   expression id=rscloc-ms-drbd1:1-rule1-expr 
attribute=#uname operation=eq value=node2/

 /rule
 rule id=rscloc-ms-drbd1:1-rule2 score=-500
   expression id=rscloc-ms-drbd1:1-rule2-expr 
attribute=#uname operation=eq value=node1/

 /rule
   /rsc_location

Solved that problem for me.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups

2008-03-14 Thread Dominik Klein

Solved that problem for me.


At least with a colocated resource I have to add.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] (Bug?) regarding resource_stickiness, master_slave and master-colocated groups

2008-03-14 Thread Dominik Klein

Dominik Klein wrote:

Solved that problem for me.


At least with a colocated resource I have to add.


Urghs. Friday afternoon ...

I just wanted to verify that and it turns out my method does not restart 
the whole thing on a slave failure. Thats true.


But it does still restart the whole thing if I shutdown the slave node 
and let it rejoin the cluster. In fact, what I see is that for a short 
time, BOTH nodes become standby. Right after that, both nodes are 
shown as online and then the master_slave resource including all clone 
instances and colocated resources are restarted.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD+Hearbeat not working as intended

2008-03-13 Thread Dominik Klein

Hi


I have a drbd+heartbeat setup and I am having a problem.

 


If the machine which drbd is master shuts down the passive machine does
not change its status to active one and because of that it cant mount
the drbd file system.

 


Can anyone give me some feedback in this ??

 

 


here is my cib.xml

 


cib admin_epoch=0 have_quorum=true ignore_dtd=false num_peers=2
cib_feature_revision=2.0 epoch=83 generated=true
ccm_transition=4 dc_uuid=56ec2257-b0e1-4395-8ca2-ff2f96151b55
num_updates=1 cib-last-written=Fri Feb 29 08:00:29 2008

   configuration

 crm_config

   cluster_property_set id=cib-bootstrap-options

 attributes

   nvpair id=cib-bootstrap-options-dc-version
name=dc-version value=2.1.3-node:
552305612591183b1628baa5bc6e903e0f1e26a3/

   nvpair id=cib-bootstrap-options-last-lrm-refresh
name=last-lrm-refresh value=1204282824/

 /attributes

   /cluster_property_set

 /crm_config

 nodes

   node id=34a67e55-71b1-421f-96cc-519ef05b110b
uname=pgslave.blumar.com.br type=normal

 instance_attributes
id=nodes-34a67e55-71b1-421f-96cc-519ef05b110b

   attributes

 nvpair id=standby-34a67e55-71b1-421f-96cc-519ef05b110b
name=standby value=off/

   /attributes

 /instance_attributes

   /node

   node id=56ec2257-b0e1-4395-8ca2-ff2f96151b55
uname=pgmaster.blumar.com.br type=normal/

 /nodes

 resources

   master_slave id=ms-drbd0

 meta_attributes id=ma-ms-drbd0

   attributes

 nvpair id=ma-ms-drbd0-1 name=clone_max value=2/

 nvpair id=ma-ms-drbd0-2 name=clone_node_max
value=1/

 nvpair id=ma-ms-drbd0-3 name=master_max value=1/

 nvpair id=ma-ms-drbd0-4 name=master_node_max
value=1/

 nvpair id=ma-ms-drbd0-5 name=notify value=yes/

 nvpair id=ma-ms-drbd0-6 name=globally_unique
value=false/

 nvpair id=ma-ms-drbd0-7 name=target_role
value=started/

   /attributes

 /meta_attributes

 primitive id=drbd0 class=ocf provider=heartbeat
type=drbd

   instance_attributes id=ia-drbd0

 attributes

   nvpair id=ia-drbd0-1 name=drbd_resource
value=repdata/

 /attributes

   /instance_attributes

   meta_attributes id=drbd0:0_meta_attrs

 attributes/

   /meta_attributes

 /primitive

   /master_slave

   group id=group_pgsql

 meta_attributes id=group_pgsql_meta_attrs

   attributes

 nvpair id=group_pgsql_metaattr_target_role
name=target_role value=stopped/


Why stopped here?


   /attributes

 /meta_attributes

 primitive id=resource_ip class=ocf type=IPaddr
provider=heartbeat

   instance_attributes id=resource_ip_instance_attrs

 attributes

   nvpair id=f294ba51-00f9-4a9b-80c0-43aa7944f474
name=ip value=10.3.3.24/

 /attributes

   /instance_attributes

   meta_attributes id=resource_ip_meta_attrs

 attributes

   nvpair id=resource_ip_metaattr_target_role
name=target_role value=started/


But started here?


 /attributes

   /meta_attributes

 /primitive

 primitive id=resource_FS class=ocf type=Filesystem
provider=heartbeat

   instance_attributes id=resource_FS_instance_attrs

 attributes

   nvpair id=d1413b97-4944-4fee-9cba-b4ab3e71f83f
name=device value=/dev/drbd0/

   nvpair id=baad1fbb-3389-4778-8832-91d5720341a6
name=directory value=/repdata/

   nvpair id=35e6e951-ace4-4bbf-9d3e-5824219a9809
name=fstype value=ext3/

 /attributes

   /instance_attributes

   meta_attributes id=resource_FS_meta_attrs

 attributes

   nvpair id=resource_FS_metaattr_target_role
name=target_role value=started/


And started here?


 /attributes

   /meta_attributes

 /primitive

 primitive id=resource_pgsql class=ocf type=pgsql
provider=heartbeat

   meta_attributes id=resource_pgsql_meta_attrs

 attributes

   nvpair id=resource_pgsql_metaattr_target_role
name=target_role value=started/


And started here as well. That does not make sense. It is sufficient to 
set the target_role for the group. You do not have to set it for each 
resource.



 /attributes

   /meta_attributes

 /primitive

   /group

 /resources

 constraints

   rsc_colocation id=colocation_ from=group_pgsql
to=group_pgsql score=INFINITY/


This does not make any sense.


   rsc_location id=location_ rsc=group_pgsql

 rule id=prefered_location_ score=INFINITY

   expression attribute=#uname
id=a7b8a885-27c2-4157-a597-42dd6cafba8c operation=eq
value=pgmaster.blumar.com.br/

 /rule

   

Re: [Linux-HA] Enhanced version of showscores and a major updateonthe score calculation documentation

2008-03-13 Thread Dominik Klein

Hi Dominik,

this looks good now. Thank you for fixing.


You're welcome.


One question: Are you able to cache the default stickiness values?
If you determine that every loop it costs time. 


Good idea. Thanks.

The script runs here 5,7 seconds for 18 resources on the DC. That's long. 


Regards
Dominik


showscores.sh
Description: Bourne shell script
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation

2008-03-12 Thread Dominik Klein

Hi

just yesterday I found a way better way to get the scores from ptest. 
This new version can not only display normal scores, it can also display 
master scores, which seems quite important these days.


It still produces a heck of a lot of logs when executed, but thats just 
the nature of the commands I use - thats not to change right now :(


Now you can also sort the output by node or resource. Resource is the 
default, give node as a parameter to sort by node.


Please test and report problems.

Also, I put a major update on the ScoreCalculation page this morning to 
make things about scores clearer. Especially concerning groups and 
master_slave resources.


Post questions if there's something that's not clear yet.

http://www.linux-ha.org/ScoreCalculation

Regards
Dominik


showscores.sh
Description: Bourne shell script
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Which STONITH devices is everybody using?

2008-03-12 Thread Dominik Klein

Thanks for the replies so far.

No one else?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation

2008-03-12 Thread Dominik Klein

Andreas Mock wrote:

Hi Dominik,

you know I like your script, but the newest version broke
something:
When a resource name has '.' (dots) in it, 


That might just be. Use dashes ;)


the way you split
the $line in the while loop to get the score, node and resource name
doesn't work any more.

Can you send me a line of relevant ptest-output with the pattern
After in it. I don't have such in my configuration.


I sent you a PM with the output from my test system.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Enhanced version of showscores and a major update on the score calculation documentation

2008-03-12 Thread Dominik Klein

Use dashes ;)


Well - turns out to be hyphen. At least as of dict.leo.org :)

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster

2008-03-11 Thread Dominik Klein

In a colocated group (which is the default), all subsequent resources
are tied to the group's first resource with a score of INFINITY.

To not allow them to run on another node but the node the first resource
is run on, they also get -INFINITY for any other node.



Thank you very much, Dominik, for your reply - but how can I then achieve the 
Intended  behavior: group failover  on the third failure ?


Although I cannot explain it score-wise, as you can only see INFINITY 
for the group resources, this should work. Just let a resource in the 
group fail a couple of times and see what happens. Works for me. I'll 
have Andrew explain this when he's back from Australia :)


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster

2008-03-11 Thread Dominik Klein

Hi Dominik,

 I tried to let the resource in the group fail a couple of times,
but after the 2-nd try will the failcount for both resources NOT increased. 


Did you wait for the cluster to restart the resource after you produced 
the failure before causing another failure?


It 
shows after each try (with ifconfig eth0:x down ) still the same:


Resource Score Node  Stickin. Failc.  Fail.-Stick.
IPaddr_193_27_40_57  0 dbora  2 0   -3
IPaddr_193_27_40_57  1 demo   2 1   -3
ubis_udbmain_13  -INFINITY dbora  2 0   -3
ubis_udbmain_13  INFINITY  demo   2 1   -3


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: AW: AW: AW: [Linux-HA] Switchover problem with DRBD

2008-03-10 Thread Dominik Klein

Schmidt, Florian wrote:

Hello again,

you're right, i do have DRBD 8.2.1 installed.

Well, you mean downgrading on 0.7.x would be better?
This is only a test-cluster so this shouldn't be a problem. But I'll try 
re-installing my current DRBD-version first and then (if this doesn't help) 
downgrading to 0.7.x-DRBD.


I guess upgrading to 8.2.5 would be the way to go. Although I'm not sure 
the issue you had was fixed in that version, a downgrade to 0.7 should 
never be needed.


I sometimes saw what you reported (no response from drbd module) during 
my own tests, but I was not able to reproduce it. Sorry.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Strange behavior of the resource group on 2 nodes cluster

2008-03-10 Thread Dominik Klein

Nikita Michalko wrote:

Hi all!

 I have some troubles with HA V2.1.3 on SLES10 SP1, two-node cluster with 1 
resource group=2 resources.  Intended is  forced failover of the group on the 
third failure of any resource in the group; one node is preferred over the 
other (see attached configuration).
After start are resources running on the preferred node (demo), as expected, 
but with 1 failcount and  with following score (script showscores):

Resource  Score  Node   Stick. Failcount  Fail.-Stickiness
IPaddr_193_27_40_57   0  dbora2   0   -3
IPaddr_193_27_40_57   2  demo2   0   -3
ubis_udbmain_13  -INFINITYdbora   2   0   -3 
ubis_udbmain_13   INFINITYdemo   2   1-3


Score of the first resource (IPaddr_193_27_40_57)  is 2 as expected (group 
resource_stickiness=1) , but the second resource has score INFINITY- 
why ? Because of  added colocation constraint for group ?


In a colocated group (which is the default), all subsequent resources 
are tied to the group's first resource with a score of INFINITY.


To not allow them to run on another node but the node the first resource 
is run on, they also get -INFINITY for any other node.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Which STONITH devices is everybody using?

2008-03-10 Thread Dominik Klein

Hi

I read the list of stonith plugins and had a look at which devices they 
support. The list of devices you can buy new today was rather small.


So I'd like to know which STONITH devices heartbeat users use.

Which device do you use?
What kind of device is it?
How much is it?
How well does it work?
What problems did you encounter?
Did you have to write the stonith plugin yourself?
If so, did you contribute it to the project?

As I'm not much of a programmer and have to buy some stonith hardware 
within the next weeks, help on this would be much appreciated.


Regards
Dominik

I'll even start with a reply myself:
APC AP7920 power distribution unit.
Cost about 450 Euros a piece.
It seems to work well, do not have it in production yet.
I set it up with apcmastersnmp.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Force switch with DRBD

2008-03-04 Thread Dominik Klein

DucaConte Balabam wrote:

Hello,
I've a cluster using heartbeat v2 and drbd in master/slave configuration.

It's:

Last updated: Tue Mar  4 09:49:30 2008
Current DC: rman1c (875afc12-b88e-4940-9816-218d2a5911c3)
2 Nodes configured.
2 Resources configured.


Node: rman1a (4d7bd4ec-c121-4b13-a2d4-aec820ea36d5): online
Node: rman1c (875afc12-b88e-4940-9816-218d2a5911c3): online

Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd):  Master rman1a
drbd0:1 (heartbeat::ocf:drbd):  Started rman1c
Resource Group: Oracle
FS  (heartbeat::ocf:Filesystem):Started rman1a
V_IP(heartbeat::ocf:IPaddr2):   Started rman1a
Ora_DB  (heartbeat::ocf:oracle):Started rman1a
Ora_LSNR(heartbeat::ocf:oralsnr):   Started rman1a

How can I force all resources to move to the other node? There's  acommand?


Try

crm_resource -M -r Oracle

This creates a rsc_location constraint, that does not allow Oracle to 
run in its momentary location. Therefore, the cluster will migrate it.


Don't forget

crm_resource -U -r Oracle

afterwards to remove that rsc_location constraint (and allow the 
resource to run on the first node again). If you do not do this, the 
resource will never be able to run on the first node again.


Depending on your configuration, the resource might be migrated back to 
the first node after the rsc_location constraint has been removed. If 
so, you should read http://wiki.linux-ha.org/ScoreCalculation and set 
resource_stickiness to a reasonable value.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to allow resources to ping-pong forever?

2008-03-03 Thread Dominik Klein

Alex Spengler wrote:

Hi,

I'm stuck in setting up my cluster.
What I want to achive is
- run apache on whatever node together with cluster IP which is
172.23.100.200.
- if apache fails - switch over to other node
- if gateway 172.23.100.1 is not reachable - switch over to other node

AND allow unlimited number of switchovers!
This is the problem, it does switch over but only once or twice and then
it's stuck .. any ideas?


*unlimited* is pretty much impossible I think.

But you may start with this:

Set a #uname score of 5050 for one node, 5000 for the other node.
Set resource_failure_stickiness to -100.
Use a multiplier of 100 for pingd and pingd as a score_attribute (sth like:
   rsc_location id=rsc-loc-syslog rsc=syslog
   rule id=syslog-connected score_attribute=pingd
   expression id=syslog-connected-rule-1 
attribute=pingd operation=defined/

   /rule
   /rsc_location
)

Then you will end up with:

Startup
node1: 5050 + 100 pingd
node2: 5000 + 100 pingd
decision: start on node1
now, whatever fails will reduce the score by 100, causing a failover.

With a start at 5000, you can have 50 failovers. Enlarge as needed.

Regards
Dominik


thanks in advance
Alex

*Here my config:*
*Node1 - big-sSTATSfe1*
eth0: 172.23.100.26
eth1: 192.168.0.1

*Node2 - big-sSTATSfe2*
eth0: 172.23.100.22
eth1:192.168.0.2

*ha.cf:*
use_logd yes
node big-sSTATSfe1 big-sSTATSfe2
deadtime 5
deadping 5
initdead 60
warntime 3
crm true
ucast eth0 172.23.100.22 # 172.23.100.26 on the second node
ucast eth1 192.168.0.2# 192.168.0.1 on the second node
ping 172.23.100.1

*cib.xml*
cib admin_epoch=1 epoch=1 num_updates=1 generated=true
have_quorum=true ignore_dtd=false num_peers=2
dc_uuid=68cd29ed-c7fe-44d9-9fe8-a1258e5b1d0f
 configuration
  crm_config
   cluster_property_set id=cib-bootstrap-options
attributes
 nvpair id=cib-bootstrap-options-symmetric-cluster
name=symmetric-cluster value=true/
 nvpair id=cib-bootstrap-options-no-quorum-policy
name=no-quorum-policy value=ignore/
 nvpair id=cib-bootstrap-options-default-resource-stickiness
name=default-resource-stickiness value=0/
 nvpair id=cib-bootstrap-options-default-resource-failure-stickiness
name=default-resource-failure-stickiness value=-100/
 nvpair id=cib-bootstrap-options-stonith-enabled
name=stonith-enabled value=true/
 nvpair id=cib-bootstrap-options-stonith-action name=stonith-action
value=reboot/
 nvpair id=cib-bootstrap-options-remove-after-stop
name=remove-after-stop value=false/
 nvpair id=cib-bootstrap-options-short-resource-names
name=short-resource-names value=true/
 nvpair id=cib-bootstrap-options-transition-idle-timeout
name=transition-idle-timeout value=1min/
 nvpair id=cib-bootstrap-options-default-action-timeout
name=default-action-timeout value=10s/
 nvpair id=cib-bootstrap-options-is-managed-default
name=is-managed-default value=true/
/attributes
   /cluster_property_set
  /crm_config
  nodes/
  resources
   group id=apache_group_p80 ordered=true collocated=true
primitive class=ocf provider=heartbeat type=IPaddr
id=IPaddr_p80
 instance_attributes id=IPaddr_1_inst_attr
  attributes
   nvpair id=IPaddr_p80_attr_0 name=ip value=172.23.100.200/
   nvpair id=IPaddr_p80_attr_1 name=netmask value=255.255.255.0/
   nvpair id=IPaddr_p80_attr_2 name=nic value=eth0/
   nvpair id=IPaddr_p80_attr_3 name=broadcast value=172.23.100.255
/
  /attributes
 /instance_attributes
 operations
  op id=IPaddr_p80_mon name=monitor interval=2s timeout=3s/
 /operations
/primitive
primitive id=apache_p80 class=lsb type=apache
provider=heartbeat
 instance_attributes id=inatt_apache_p80
  attributes
   nvpair name=configfile value=/etc/httpd/conf/httpd.conf
id=nvpb1_apache_p80/
   nvpair name=statusurl value=
http://172.23.100.200:80/server-status http://172.23.100.200/server-status
id=nvpb2_apache_p80/
  /attributes
 /instance_attributes
 operations
  op id=apache_p80:start name=start timeout=10s/
  op id=apache_p80:stop name=stop timeout=10s/
  op id=apache_p80:monitor name=monitor interval=2s
timeout=5s/
 /operations
/primitive
   /group
   clone id=pingd
instance_attributes id=pingd
 attributes
  nvpair id=pingd-clone_max name=clone_max value=2/
  nvpair id=pingd-clone_node_max name=clone_node_max value=1/
 /attributes
/instance_attributes
primitive id=gateway class=ocf type=pingd provider=heartbeat
 operations
  op id=gateway:child-monitor name=monitor interval=5s
timeout=5s prereq=nothing/
  op id=gateway:child-start name=start prereq=nothing/
 /operations
 instance_attributes id=pingd_inst_attrs
  attributes
   nvpair id=pingd-dampen name=dampen value=5s/
   nvpair id=pingd-multiplier name=multiplier value=100/
  /attributes
 /instance_attributes
/primitive
   /clone
  

Re: [Linux-HA] (no subject)

2008-02-29 Thread Dominik Klein

Dominik Klein wrote:

Schmidt, Florian wrote:

Hello list,
I still have problem with my heartbeat-config

I want heartbeat to start AFD. I checked the RA for LSB-compatibility
and think that it's right now.

The log file says, that the bash does not find the command afd.


crmd[2725]: 2008/02/29_11:32:00 info: do_lrm_rsc_op: Performing
op=AFD_start_0 key=5:1:71d7b119-77c3-4e6b-9ac6-6acf20d8cf61)
lrmd[2722]: 2008/02/29_11:32:01 info: rsc:AFD: start
lrmd[2785]: 2008/02/29_11:32:01 WARN: For LSB init script, no additional
parameters are needed.
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout)
Starting AFD for afdha :
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stderr)
bash: afd: command not found

tengine[2773]: 2008/02/29_11:32:01 info: extract_event: Aborting on
transient_attributes changes for 44425bd9-2cba-4d6a-ac62-82a8bb81a23d
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout)
Failed

tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort
priority upgraded to 100
crmd[2725]: 2008/02/29_11:32:01 ERROR: process_lrm_event: LRM operation
AFD_start_0 (call=3, rc=1) Error unknown error
tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort
action 0 superceeded by 2
tengine[2773]: 2008/02/29_11:32:01 WARN: status_from_rc: Action start on
noderz failed (target: null vs. rc: 1): Error
tengine[2773]: 2008/02/29_11:32:01 WARN: update_failcount: Updating
failcount for AFD on 91d062c3-ad0a-4c24-b759-acada7f19101 after failed
start: rc=1

There is an extra user for AFD, called afdha. The afdha-script is
attached.

Switching to to afdha via su afdha I can start AFD, so I don't
understand, where's the problem. (


su $afduser -c afd -a

Try
su - $afduser -c afd -a
---^

This will make su login as the user (i.e. execute its profile).

Right now, you're running this as root, in the root-environment. If 
afd is not in root's PATH, then it cannot work.


To add some more possibilities:

You could also use
export PATH=$PATH:whatever
at the start of the script, so you have afd in your PATH.

Or use /path/to/afd instead of afd.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] (no subject)

2008-02-29 Thread Dominik Klein

Schmidt, Florian wrote:
Hello list, 


I still have problem with my heartbeat-config

I want heartbeat to start AFD. I checked the RA for LSB-compatibility
and think that it's right now.

The log file says, that the bash does not find the command afd.


crmd[2725]: 2008/02/29_11:32:00 info: do_lrm_rsc_op: Performing
op=AFD_start_0 key=5:1:71d7b119-77c3-4e6b-9ac6-6acf20d8cf61)
lrmd[2722]: 2008/02/29_11:32:01 info: rsc:AFD: start
lrmd[2785]: 2008/02/29_11:32:01 WARN: For LSB init script, no additional
parameters are needed.
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout)
Starting AFD for afdha :
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stderr)
bash: afd: command not found

tengine[2773]: 2008/02/29_11:32:01 info: extract_event: Aborting on
transient_attributes changes for 44425bd9-2cba-4d6a-ac62-82a8bb81a23d
lrmd[2722]: 2008/02/29_11:32:01 info: RA output: (AFD:start:stdout)
Failed

tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort
priority upgraded to 100
crmd[2725]: 2008/02/29_11:32:01 ERROR: process_lrm_event: LRM operation
AFD_start_0 (call=3, rc=1) Error unknown error
tengine[2773]: 2008/02/29_11:32:01 info: update_abort_priority: Abort
action 0 superceeded by 2
tengine[2773]: 2008/02/29_11:32:01 WARN: status_from_rc: Action start on
noderz failed (target: null vs. rc: 1): Error
tengine[2773]: 2008/02/29_11:32:01 WARN: update_failcount: Updating
failcount for AFD on 91d062c3-ad0a-4c24-b759-acada7f19101 after failed
start: rc=1

There is an extra user for AFD, called afdha. The afdha-script is
attached.

Switching to to afdha via su afdha I can start AFD, so I don't
understand, where's the problem. (


su $afduser -c afd -a

Try
su - $afduser -c afd -a
---^

This will make su login as the user (i.e. execute its profile).

Right now, you're running this as root, in the root-environment. If 
afd is not in root's PATH, then it cannot work.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: AW: [Linux-HA] (no subject)

2008-02-29 Thread Dominik Klein

Hello,

I wrote the following lines on top of the script code:
export PATH=$PATH:/home/afdha
export AFD_WORK_DIR=/usr/afd

AFD needs one environment-variable named AFD_WORK_DIR to know, where to
work

I also did chown afdha:501 /usr/afd and chown /home/afdha

How far does this work, because the heartbeat still does not start AFD,
but logs errors:


crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing
op=AFD_start_0 key=4:1:151fabd8-d44a-40b1-b946-38480dbd8c8f)
lrmd[3538]: 2008/02/29_13:58:26 info: rsc:AFD: start
lrmd[3583]: 2008/02/29_13:58:26 WARN: For LSB init script, no additional
parameters are needed.
lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout)
Starting AFD for afdha :
lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr)
ERROR   : Failed to determine AFD working directory!

lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr)
No option -w or environment variable AFD_WORK_DIR set.


This read as if you could start afd -w /usr/afd ... to set the work dir.


lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout)
Failed.

crmd[3541]: 2008/02/29_13:58:26 info: process_lrm_event: LRM operation
AFD_start_0 (call=3, rc=0) complete
tengine[3556]: 2008/02/29_13:58:26 info: match_graph_event: Action
AFD_start_0 (4) confirmed on noderz (rc=0)
tengine[3556]: 2008/02/29_13:58:26 info: send_rsc_command: Initiating
action 5: AFD_monitor_3 on noderz
crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing
op=AFD_monitor_3 key=5:1:151fabd8-d44a-40b1-b946-38480dbd8c8f)
lrmd[3538]: 2008/02/29_13:58:27 info: RA output: (AFD:monitor:stderr)
ERROR   : Failed to determine AFD working directory!
  No option -w or environment variable AFD_WORK_DIR set.



How can I set this environment-variable AFD_WORK_DIR  globally
I have it in ~/.bash_profile, but this doesn't work for the afdha-user
:-(

I'm working with linux only since 4 or 5 weeks (since I started building
that heartbeat/drbd-cluster)...so my skills are not that good, but I try
to improve ;)

Thanks
Florian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Script to calculate scores to allow a defined number of resource failures before failover

2008-02-29 Thread Dominik Klein

Some cosmetic changes. Thanks to wschlich.

Regards
Dominik


calc_linux_ha_scores.sh
Description: Bourne shell script
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Question of Service Monitoring in HAv2

2008-02-29 Thread Dominik Klein
  I had a query about the service monitoring in HA v2, I was 
wondering if i can configure it in such a way that if a service fails , 
heartbeat should try to restart it say n number of times before it fences the 
system


It will (by default) only fence the system, if stop fails.

If monitor fails, it will only try to restart the resource on the best node.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Node Priority

2008-02-29 Thread Dominik Klein

When heartbeat is started(on both nodes) my node called pgslave gets
promoted to DC and that cannot happen, my node pgmaster should always be
the active part of the service and also this node needs to always try to
get promoted to DC when he have the chance(pgslave gotta spend minimal
time being DC, only in case of pgmaster failing).

 


Can you guys give me some advice on how to solve this ??


Start pgmaster alone, wait until it becomes DC, then start pgslave.

Other than that, you can not do anything about which node is the DC.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: AW: AW: [Linux-HA] (no subject)

2008-02-29 Thread Dominik Klein

Schmidt, Florian wrote:


Mit freundlichen Grüßen


Hello,

I wrote the following lines on top of the script code:
export PATH=$PATH:/home/afdha
export AFD_WORK_DIR=/usr/afd

AFD needs one environment-variable named AFD_WORK_DIR to know, where

to

work

I also did chown afdha:501 /usr/afd and chown /home/afdha

How far does this work, because the heartbeat still does not start AFD,
but logs errors:


crmd[3541]: 2008/02/29_13:58:26 info: do_lrm_rsc_op: Performing
op=AFD_start_0 key=4:1:151fabd8-d44a-40b1-b946-38480dbd8c8f)
lrmd[3538]: 2008/02/29_13:58:26 info: rsc:AFD: start
lrmd[3583]: 2008/02/29_13:58:26 WARN: For LSB init script, no additional
parameters are needed.
lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stdout)
Starting AFD for afdha :
lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr)
ERROR   : Failed to determine AFD working directory!

lrmd[3538]: 2008/02/29_13:58:26 info: RA output: (AFD:start:stderr)
No option -w or environment variable AFD_WORK_DIR set.

This read as if you could start afd -w /usr/afd ... to set the work dir.


I'm just trying this...

But is there no possibility to set this variable globally?

So well, thanks...THIS worksbut now next errors occurred -.-


I once again tried this with 
su - afdha

/etc/init.d/afdha start
Starting AFD for afdha : Password:
su: incorrect password
Failed.
touch: cannot touch `/var/lock/subsys/afd': Permission denied

I don't know, what password he needs...

The other problem is the denied permission for /var/lock/subsys/afd. How can I 
permit him to create this file there? I can create it by hand and chown this to 
him, but he will remove it at next stop :(


join IRC #linux-ha :)
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Awesome explanation of stickiness scores :)

2008-02-28 Thread Dominik Klein

Fajar Priyanto wrote:

Hi all,
This afternoon I would have been asked a question about resource|
failure_stickiness, because I was a bit confused the practical use of those 
stickiness in relation with score in location constrains. But, this page has 
explained it all very clearly: http://www.linux-ha.org/ScoreCalculation

Excellent :)

Hopefully this helps other as it goes into the list's archives.


I'll consider this a compliment :)

Thanks!

Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Deleting Master/Slave-Resources

2008-02-28 Thread Dominik Klein

Schmidt, Florian wrote:
Hi list, 


I'm not able to delete my DRBD-Master/Salve-Set.

I tried with 
crm_resource -D -r drbd_master_slave -t clone

and
crm_resource -D -r drbd_master_slave -t master-slave

drbd_master_slave is the name of my resource.

Anyone a short advice?


cibadmin -Q -o resources

copy your resource to the clipboard

cibadmin -D -X 'paste'enter

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Newest version of 'showscores'

2008-02-28 Thread Dominik Klein

thanky you for the script and for pointing to the right direction.
May I change the format of the output? (Yes, I also saw the thing with
wrong headings)


Here's a newer version. It can now read resource-stickiness and 
resource_stickiness (notice the - and _). Both is possible, but up to 
now, only one was looked for.

Also fixed a problem with the headings being mixed up.

I know this thing produces a lot of logs, but at least it does display 
the scores, hu? :)


#!/bin/bash

# Feb 2008, Dominik Klein
# Display scores of Linux-HA resources

# Known issues:
# * cannot get resource[_failure]_stickiness values for master/slave and 
clone resources
#   if those values are configured as meta attributes of the 
master/slave or clone resource

#   instead of as meta attributes of the encapsulated primitive

if [ `crmadmin -D | cut -d' ' -f4` != `uname -n|tr [:upper:] 
[:lower:]` ]

  then echo Warning: Script running not on DC. Might be slow(!)
fi

# Heading
printf %-16s%-16s%-16s%-16s%-16s%-16s\n Resource Score Node 
Stickiness Failcount Failure-Stickiness


21 ptest -LVVV|grep -E assign_node|rsc_location|grep -w -E \ 
[-]{0,1}[0-9]*$|while read line

do
node=`echo $line|cut -d ' ' -f 8|cut -d ':' -f 1`
res=`echo $line|cut -d ' ' -f 6|tr -d ,`
score=`echo $line|cut -d ' ' -f 9|sed 's/100/INFINITY/g'`

# get meta attribute resource_stickiness
if crm_resource -g resource_stickiness -r $res --meta /dev/null
then
stickiness=`crm_resource -g resource_stickiness -r $res 
--meta 2/dev/null`
else if crm_resource -g resource-stickiness -r $res 
--meta /dev/null

then
stickiness=`crm_resource -g resource-stickiness 
-r $res --meta 2/dev/null`
# if that doesnt exist, get syntax like 
primitive resource-stickiness=100
else if ! stickiness=`crm_resource -x -r $res 
2/dev/null | grep -E master|primitive|clone | grep -o 
resource[_-]stickiness=\[0-9]*\ | cut -d '' -f 2 | grep -v ^$`

then
# if no resource-specific 
stickiness is confiugured, grep the default value
stickiness=`cibadmin -Q -o 
crm_config 2/dev/null|grep default[_-]resource[_-]stickiness|grep -o 
-E 'value ?= ?[^ ]*'|cut -d '' -f 2|grep -v ^$`

fi
fi
fi

# get meta attribute resource_failure_stickiness
if crm_resource -g resource_failure_stickiness -r $res --meta 
/dev/null

then
failurestickiness=`crm_resource -g 
resource_failure_stickiness -r $res --meta 2/dev/null`
else if crm_resource -g resource-failure-stickiness -r 
$res --meta /dev/null

then
failurestickiness=`crm_resource -g 
resource-failure-stickiness -r $res --meta 2/dev/null`

else
# if that doesnt exist, get the default 
value
failurestickiness=`cibadmin -Q -o 
crm_config 2/dev/null|grep resource[_-]failure[_-]stickiness|grep -o 
-E 'value ?= ?[^ ]*'|cut -d '' -f 2|grep -v ^$`

fi
fi

failcount=`crm_failcount -G -r $res -U $node 2/dev/null|grep 
-o -E 'value ?= ?[0-9]*'|cut -d '=' -f 2|grep -v ^$`


printf %-16s%-16s%-16s%-16s%-16s%-16s\n $res $score $node 
$stickiness $failcount $failurestickiness

done|sort -k 1
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] linux-ha with drbd -- nothing working

2008-02-28 Thread Dominik Klein

Adam Kaufman wrote:

hi all,

I've been trying for the last few days to get heartbeat working with a basic 
drbd configuration.  I was initially using heartbeat 2.0.8, but eventually 
upgraded to 2.1.3.  the symptoms exhibited by each version of heartbeat were 
completely different, so here I'll focus on 2.1.3, as that's what I'd prefer to 
move forward with.

symptoms:

upon starting heartbeat, a monitor command is sent through the drbd resource, 
which returns OCF_NOT_RUNNING
the drbd resource(s) is not registered with crm_mon, and cannot be transitioned 
into a started state

configuration:

===cib.xml===
?xml version=1.0 ?
cib admin_epoch=17 epoch=1 have_quorum=true num_updates=0
  configuration
crm_config
  cluster_property_set id=test_cluster
attributes
  nvpair id=id-symmetric-cluster name=symmetric-cluster 
value=True/
  nvpair id=id-stickiness name=default-resource-stickiness 
value=INFINITY/
/attributes
  /cluster_property_set
/crm_config
nodes/
resources
  primitive class=ocf id=ip_resource provider=heartbeat type=IPaddr 
resource_stickiness=INFINITY


You should define resource_stickiness as a meta_attribute. It will work 
like this, but it is not the best way.



instance_attributes
  attributes
nvpair name=ip value=10.107.10.20/
nvpair name=netmask value=22/
nvpair name=nic value=eth0/
  /attributes
/instance_attributes
  /primitive
  master_slave id=ms-drbd0
meta_attributes id=ma-ms-drbd0
  attributes
nvpair id=ma-ms-drbd0-1 name=clone_max value=2/
nvpair id=ma-ms-drbd0-2 name=clone_node_max value=1/
nvpair id=ma-ms-drbd0-3 name=master_max value=1/
nvpair id=ma-ms-drbd0-4 name=master_node_max value=1/
nvpair id=ma-ms-drbd0-5 name=notify value=yes/
nvpair id=ma-ms-drbd0-6 name=globally_unique value=false/
nvpair id=ma-ms-drbd0-7 name=target_role value=stopped/


You don't want the resource to be started. That's why the score is 
-INFINITY.



  /attributes
/meta_attributes
primitive class=ocf id=drbd0 provider=heartbeat type=drbd 
resource_stickiness=INFINITY


see above


  instance_attributes id=ia-drbd0
attributes
  nvpair id=ia-drbd0-1 name=drbd_resource value=drbd0/
/attributes
  /instance_attributes
/primitive
  /master_slave
/resources
constraints
  rsc_location id=run_ip_resource rsc=ip_resource
rule id=pref_run_ip_resource score=100
  expression attribute=#uname operation=eq value=plat-pc-18/
/rule
  /rsc_location
/constraints
  /configuration
  status/
/cib
===drbd.conf===
resource drbd0 {
  protocol C;
  incon-degr-cmd echo 'incon-degr-cmd';
  startup {
degr-wfc-timeout 120;
  }
  disk {
on-io-error pass_on;
  }
  net {
on-disconnect reconnect;
  }
  syncer {
rate 100M;
group 1;
al-extents 257;
  }
  # Begin chassis configuration
  on plat-pc-18 {
device /dev/drbd0;
disk   /dev/mapper/nrStorage-storage;
address10.107.10.38:7789;
meta-disk  internal;
  }
  on plat-pc-17 {
device /dev/drbd0;
disk   /dev/mapper/nrStorage-storage;
address10.107.10.37:7789;
meta-disk  internal;
  }
}
debug output:

===crm_verify -L===
crm_verify[7494]: 2008/02/28_17:15:14 info: main: =#=#=#=#= Getting XML 
=#=#=#=#=
crm_verify[7494]: 2008/02/28_17:15:14 info: main: Reading XML from: live cluster
crm_verify[7494]: 2008/02/28_17:15:14 notice: main: Required feature set: 2.0
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'stop' for cluster option 'no-quorum-policy'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'false' for cluster option 'stonith-enabled'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'reboot' for cluster option 'stonith-action'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value '0' for cluster option 'default-resource-failure-stickiness'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'true' for cluster option 'is-managed-efault'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value '60s' for cluster option 'cluster-delay'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value '30' for cluster option 'batch-limit'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value '20s' for cluster option 'default-action-imeout'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'true' for cluster option 'stop-orphan-resources'
crm_verify[7494]: 2008/02/28_17:15:14 debug: cluster_option: Using default 
value 'true' for cluster option 'stop-orphan-actions'
crm_verify[7494]: 

Re: [Linux-HA] Newest version of 'showscores'

2008-02-28 Thread Dominik Klein

Serge Dubrouski wrote:

On Thu, Feb 28, 2008 at 9:53 AM, Dejan Muhamedagic [EMAIL PROTECTED] wrote:

Hi,


 On Thu, Feb 28, 2008 at 03:47:04PM +0100, Dominik Klein wrote:
  thanky you for the script and for pointing to the right direction.
  May I change the format of the output? (Yes, I also saw the thing with
  wrong headings)
 
  Here's a newer version. It can now read resource-stickiness and
  resource_stickiness (notice the - and _). Both is possible, but up to now,
  only one was looked for.
  Also fixed a problem with the headings being mixed up.

 I was wondering if you could do it the other way around, i.e. one
 gives failover requirements such as after third failure move to
 the other node and the script calculates the various stickiness
 values. How about that?


And I was wondering why it can't be done on CRM level? It would be
great to be able to define a max number of allowed failures and let
CRM to calculate all necessary scores.


Full ACK. That would be nice.

I guess I could script sth like that.

But imho: If you read the ScoreCalculation page on linux-ha.org, 
calculating the necessary values for resource_failure_stickiness should 
be possible for someone who is to administer a cluster.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Clonesets and Resource Groups

2008-02-27 Thread Dominik Klein

Michael is right.


Wörz that is :)
Didn't see you both had the same first name. Sorry.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Clonesets and Resource Groups

2008-02-27 Thread Dominik Klein

The biggest issue so far is to migrate constraint resources frome one node
to another with a single command. I cannot use grouped resources bcs one of
the resources must be a cloneset (ocfs) and thus cannot be a member of a
group.

Why not? You just cannot create this in the GUI. Use CLI.


Michael is right.

beekhof 02/21/2007 10:16 PM:
groups cannot contain anything except primitive resources

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to create meta-data?

2008-02-27 Thread Dominik Klein

  1.Iam using DRBD-0.7.21 , how to create meta-data for this system and how to 
upgrade it to DRBD-8 meta-data?


http://blogs.linbit.com/florian/2007/10/03/step-by-step-upgrade-from-drbd-07-to-drbd-8/


  2.My meta-disk option in /etc/drbd.conf file has /dev/hda6 as entry.Is this 
same as internal meta-data?


Sounds more like external meta-data. Post your config file.

I don't want to be rude, but did you read any drbd documentation yet? 
Did you set up the system yourself? Do you ever use google?


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to trigger stonith of node

2008-02-27 Thread Dominik Klein

is there a way to trigger the stonith of a node for testing?


pkill -9 heartbeat
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Newest version of 'showscores'

2008-02-27 Thread Dominik Klein

Andreas Mock wrote:

Hi all,

can someone point me to the newest version of 'showscores'.


This should be the newest version.

http://hg.clusterlabs.org/pacemaker/dev/rev/86e1f081dc7f

Apparently it has a mix-up in the headings, but that shouldn't hurt.


Someone posted it here once.


Yup, that was me :)

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to create meta-data?

2008-02-26 Thread Dominik Klein
  What does meta-data in DRBD mean 


http://www.drbd.org/users-guide/ch-internals.html#s-metadata


and how to create meta-data.


http://www.drbd.org/users-guide/s-first-time-up.html

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd heartbeat v2

2008-02-19 Thread Dominik Klein

Damon Estep wrote:

On this page: http://www.linux-ha.org/DRBD/HowTov2 is this comment:
drbd must not be started by init


Well, you do not have to start drbd by init. But it shouldn't harm if 
you do.


This statement is false if you want to use the heartbeat Resource Agent 
drbddisk, but that's not what's described in the article.



I have tried both 0.7.25 and 8.0.10 for DRBD on heartbeat 2.1.3 with
crm=yes and everything set up as outlined on the page.

 


The only reliable combination I can create is drbd 0.7.25, heartbeat
2.1.3, and drbd started by init.


What kind of errors do you get? Post config and logs please.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith on an apcmaster

2008-02-19 Thread Dominik Klein
The stonith daemons start successfully now, but with a monitor interval 
of 15s one of the two fails fairly quickly.  The apc (9211 masterswitch) 
only allows a single login, and I wonder if the two daemons aren't 
colliding, and one is timing out and giving up. 


Did you try apcmastersnmp?

Don't know wether you have to change that for your particular device as 
well, though.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd heartbeat v2 working (problem with fs0)

2008-02-19 Thread Dominik Klein

Marco Leone wrote:

Hi,

I'm using drbd 8.2.4 and heartbeat v.2 too on two ubuntu 7.04 server nodes.


I guess you did not completey do that.

I followed this link http://linux-ha.org/DRBD/HowTov2 
 constraints

   rsc_location id=rsc_location_group_1 rsc=group_1
 rule id=prefered_location_group_1 score=100
   expression attribute=#uname 
id=prefered_location_group_1_expr operation=eq value=ub704ha01/

 /rule
   /rsc_location
   rsc_order id=drbd0_before_fs0 from=fs0 action=start 
to=ms-drbd0 to_action=promote/

 /constraints
   /configuration
 /cib


Otherwise you'd have the master colocation constraint:

   rsc_colocation id=fs0_on_drbd0 to=ms-drbd0 to_role=master 
from=fs0 score=infinity/


This should do what you want.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd heartbeat v2

2008-02-19 Thread Dominik Klein

crm_verify[19814]: 2008/02/19_08:46:57 WARN: unpack_rsc_op: Processing
failed op drbd0:1_start_0 on cn2-inverness-co: Error
crm_verify[19814]: 2008/02/19_08:46:57 WARN: unpack_rsc_op:
Compatability handling for failed op drbd0:1_start_0 on cn2-inverness-co
crm_verify[19814]: 2008/02/19_08:46:57 WARN: native_color: Resource
drbd0:1 cannot run anywhere
Warnings found during check: config may not be valid


As you can see here, and as in after.txt - start on the drbd resource 
on cn2-inverness-co (which I assume is the host you powered off) failed. 
 Check the logs on that. You have a timestamp so that should be fairly 
easy.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD 8.0 under Debian Etch?

2008-02-07 Thread Dominik Klein
Short question: Does anyone here have DRBD8 running with heartbeat under 
Etch?


Short answer: Yes.

Version 8.0.8, upgrading to 8.0.9 within the next days. I use the OCF RA 
to manage drbd as a Master/Slave Resource.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] send_arp cisco Pix v7.2 = arp table not update

2008-01-22 Thread Dominik Klein

   Thanks for your advise, unfortunatelly, that command is not include in the 
PIX :-(( ... I'am still on that point and I must confess that I have no clue at 
all...


You could also modify the RA and have it set a virtual mac address on 
the interface. Be sure to set the original mac address back when the VIP 
leaves that machine, though.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: DRBD with monitor Operations won't start - as soon as I delete the operations, it starts immediately

2008-01-04 Thread Dominik Klein

operations
op id=op-ms-drbd2-1 name=monitor interval=60s 
timeout=60s start_delay=30s role=Master/
op id=op-ms-drbd2-2 name=monitor interval=60s 
timeout=60s start_delay=30s role=Slave/
/operations


You cannot have multiple operations with the same interval on a 
resource. Try to change the interval.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi example configuration

2008-01-04 Thread Dominik Klein

How can I test the stonith plugin eg. tell heartbeat to shoot someone?


iptables -I INPUT -j DROP
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DBRD - split brain - and HA is happily migrating

2008-01-02 Thread Dominik Klein

Thanks for your help. It looks like everything works as desired:

(postgres-02) [~] ifconfig eth1 down
(postgres-02) [~] cat /proc/drbd
version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 
2007-12-29 17:37:25
 0: cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown C r---
ns:60499852 nr:713732 dw:713732 dr:60499852 al:0 bm:3693 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:3777806 misses:3724 starving:0 dirty:0 
changed:3724
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
(postgres-02) [~] ifconfig eth1 up
(postgres-02) [~] cat /proc/drbd
version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 
2007-12-29 17:37:25
 0: cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown C r---
ns:60499852 nr:713732 dw:713732 dr:60499852 al:0 bm:3693 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:3777806 misses:3724 starving:0 dirty:0 
changed:3724
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
(postgres-02) [~] cat /proc/drbd
version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by [EMAIL PROTECTED], 
2007-12-29 17:37:25
 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:60499852 nr:715292 dw:715292 dr:60499852 al:0 bm:3705 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:3777942 misses:3736 starving:0 dirty:0 
changed:3736
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

When I put the crosslink link down, the disk on postgres-02 gets outdated, if I
put it back on, it syncs in no-time.


You should be aware of one thing though:

If you have a DRBD splitbrain now and your primary crashes whilst in 
splitbrain, heartbeat will never be able to start your resource on the 
secondary node, as the DRBD resource is outdated. Read: Your resource 
will not run at all, not even with old data.


You will have to manually do something like drbdadm -- 
--overwrite-data-of-peer primary $resource to get the device into 
primary state on an outdated disconnected secondary.


When the crashed primary comes back, you need to drbdadm -- 
--discard-my-data connect $resource (on the crashed primary) to get it 
in sync again - heartbeat is not able to do that on its own (Which is 
good. It shouldnt know a way to force a device to primary state).


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DBRD - split brain - and HA is happily migrating

2008-01-01 Thread Dominik Klein

Thomas Glanzmann wrote:

Hello,
I have drbd (newest version; same goes for heartbeat) running as a
master/slave ressource on the latest heart beat ressource and had the
following problem. I had a split brain situation 


In DRBD or in the entire cluster?


and heartbeat made it
possible to migrate from one node to another and I wonder how that is
possible? How do other people handle this situation. My setup so far is
the following:


You didnt give your drbd.conf, but I suppose you do not use DRBD 
resource fencing. Without resource fencing, it is perfectly possible to 
execute drbdadm primary $resource on a disconnected secondary.


Take a look at the resource fencing function in DRBD. The primary will 
then use another communication path to set the secondary resource to an 
outdated status. drbdadm primary on an outdated resource will not 
succeed unless you manually force it (the DRBD RA does not do that).


Read the man drbd.conf section about it and this link is also worth a 
read: 
http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD - ext3 - IP - PostgreSQL Setup

2007-12-30 Thread Dominik Klein

Thomas Glanzmann wrote:

Hello,
I have the following setup: DRBD = ext3 = IPaddr2 = pgsql. I have the
following configured:

00_README:# Ressourcen hinzufügen:
00_README:
00_README:cibadmin -o resources -C -x 01_drbd
00_README:cibadmin -o resources -C -x 02_filesystem
00_README:cibadmin -o constraints -C -x 03_constraint_run_on
00_README:cibadmin -o constraints -C -x 04_order_drbd_before_fs0
00_README:cibadmin -o constraints -C -x 05_colocation_drbd_master_on_fs0
00_README:
00_README:cibadmin -o resources -C -x 06_ip_address
00_README:cibadmin -o resources -C -x 07_pgsql
00_README:
00_README:cibadmin -o constraints -C -x 08_order_fs0_before_pgsql
00_README:cibadmin -o constraints -C -x 09_order_ip0_before_pgsql
00_README:cibadmin -o constraints -C -x 10_colocation_pgsql_ip0
00_README:cibadmin -o constraints -C -x 11_colocation_pgsql_fs0
00_README:cibadmin -o constraints -C -x 12_colocation_fs0_ip0
00_README:
00_README:# DRBD / FS starten:
00_README:
00_README:crm_resource -r ms-drbd0 -v '#default' --meta -p target_role
00_README:crm_resource -r fs0 -v '#default' --meta -p target_role
00_README:crm_resource -r pgsql0 -v '#default' --meta -p target_role
00_README:
00_README:00_README
00_README:01_drbd
00_README:02_filesystem
00_README:03_constraint_run_on
00_README:04_order_drbd_before_fs0
00_README:05_colocation_drbd_master_on_fs0
00_README:06_ip_address
00_README:07_pgsql
00_README:08_order_fs0_before_pgsql
00_README:09_order_ip0_before_pgsql
00_README:10_colocation_pgsql_ip0
00_README:11_colocation_pgsql_fs0
00_README:12_colocation_fs0_ip0
01_drbd:   master_slave id=ms-drbd0
01_drbd: meta_attributes id=ma-ms-drbd0
01_drbd:   attributes
01_drbd: nvpair id=ma-ms-drbd0-1 name=clone_max value=2/
01_drbd: nvpair id=ma-ms-drbd0-2 name=clone_node_max 
value=1/
01_drbd: nvpair id=ma-ms-drbd0-3 name=master_max value=1/
01_drbd: nvpair id=ma-ms-drbd0-4 name=master_node_max 
value=1/
01_drbd: nvpair id=ma-ms-drbd0-5 name=notify value=yes/
01_drbd: nvpair id=ma-ms-drbd0-6 name=globally_unique 
value=false/
01_drbd: nvpair id=ma-ms-drbd0-7 name=target_role 
value=stopped/
01_drbd:   /attributes
01_drbd: /meta_attributes
01_drbd: primitive id=drbd0 class=ocf provider=heartbeat 
type=drbd
01_drbd:   instance_attributes id=ia-drbd0
01_drbd: attributes
01_drbd:   nvpair id=ia-drbd0-1 name=drbd_resource 
value=postgres/
01_drbd: /attributes
01_drbd:   /instance_attributes
01_drbd: /primitive
01_drbd:   /master_slave
02_filesystem:primitive class=ocf provider=heartbeat type=Filesystem 
id=fs0
02_filesystem:meta_attributes id=ma-fs0
02_filesystem:attributes
02_filesystem:nvpair name=target_role id=ma-fs0-1 
value=stopped/
02_filesystem:/attributes
02_filesystem:/meta_attributes
02_filesystem:
02_filesystem:instance_attributes id=ia-fs0
02_filesystem:attributes
02_filesystem:nvpair id=ia-fs0-1 name=fstype 
value=ext3/
02_filesystem:nvpair id=ia-fs0-2 name=directory 
value=/srv/postgres/
02_filesystem:nvpair id=ia-fs0-3 name=device 
value=/dev/drbd0/
02_filesystem:/attributes
02_filesystem:/instance_attributes
02_filesystem:/primitive
03_constraint_run_on:rsc_location id=drbd0-placement-1 rsc=ms-drbd0
03_constraint_run_on:rule id=drbd0-rule-1 score=-INFINITY
03_constraint_run_on:expression id=exp-01 value=postgres-01 
attribute=#uname operation=ne/
03_constraint_run_on:expression id=exp-02 value=postgres-02 
attribute=#uname operation=ne/
03_constraint_run_on:/rule
03_constraint_run_on:/rsc_location
04_order_drbd_before_fs0:rsc_order id=drbd0_before_fs0 from=fs0 action=start 
to=ms-drbd0 to_action=promote/
05_colocation_drbd_master_on_fs0:rsc_colocation id=fs0_on_drbd0 to=ms-drbd0 to_role=master 
from=fs0 score=infinity/
06_ip_address:primitive class=ocf provider=heartbeat type=IPaddr2 
id=ip0
06_ip_address:meta_attributes id=ma-ip0
06_ip_address:attributes
06_ip_address:nvpair name=target_role id=ma-ip0-1 
value=stopped/
06_ip_address:/attributes
06_ip_address:/meta_attributes
06_ip_address:
06_ip_address:instance_attributes id=ia-ip0
06_ip_address:attributes
06_ip_address:nvpair id=ia-ip0-1 name=ip 
value=172.17.0.20/
06_ip_address:nvpair id=ia-ip0-2 name=cidr_netmask 
value=24/
06_ip_address:nvpair id=ia-ip0-3 name=nic 
value=eth0.2/
06_ip_address:/attributes
06_ip_address:/instance_attributes
06_ip_address:/primitive
07_pgsql:primitive class=ocf provider=heartbeat type=pgsql id=pgsql0
07_pgsql:

Re: [Linux-HA] DRBD Config

2007-12-21 Thread Dominik Klein
we are using a two node cluster master/slave with an openSuSE 10.3, 
heartbeat 2.0.7 and drbd 8.0.6.


I tried the configuration from this webpage:
http://www.linux-ha.org/DRBD/HowTov2


This should only be done with a recent version of heartbeat/crm. There 
have been major improvements on multistate resources since 2.0.7. I 
suggest to try the 2.1.3 testing version from 
http://hg.linux-ha.org/test/archive/tip.tar.gz or at least the latest 
interim build from 
http://download.opensuse.org/repositories/server:/ha-clustering/openSUSE_10.3/


Especially the testing version works like a charm with DRBD8 here, 
although the interim build should fine as well.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD Config

2007-12-21 Thread Dominik Klein

2) DRBD8 is NOT supported from heartbeat. Please use DRBD0.7


I know the howto states so, but did you try it? Works for me ...

Imho, the docs are outdated about that.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD Config

2007-12-21 Thread Dominik Klein
Dec 20 12:57:49 mylogin1 drbd[7119]: [7131]: DEBUG: : Calling 
/sbin/drbdadm -c /etc/drbd.conf state

Dec 20 12:57:49 mylogin1 drbd[7119]: [7134]: DEBUG: : Exit code 0


can you c/p what you get when you issue
   /sbin/drbdadm -c /etc/drbd.conf state
by hand?


That's a syntax error. The resource is missing (as stated in the first 
email).


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD Config

2007-12-21 Thread Dominik Klein
Please see my thread from Oct 18th on this list and esp the answer from Andrew 
from Oct 22nd.


I read that thread. There also was someone else who stated it's working 
for him.



Can you confirm that It works? For LARGE partitions? Would be good news!


The largest partition I manage with DRBD v8 in heartbeat has 130 GB. 
Works as expected. If you look at the commands the RA issues - why 
should it not?


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Relevance of STONITH with Xen over DRBD setup

2007-12-14 Thread Dominik Klein

I just want to confirm this. From what i've learned so far, STONITH is
relevant only to avoid data corruption when using shared storage. So, is
STONITH relevant when i'm using a non-shared setup with Heartbeat and XEN VM
on top of DRBD? Xen is using file images created on top of a ext3 FS on top
of DRBD, and there should not be any concurrent access.


Here you say yourself that you need STONITH :)


You might as well be fine with DRBD resource fencing.

As long as communication paths are redundant, you *should* not end up 
with a domU running on both dom0s. You still might, if DRBD resource 
fencing does not work properly.


To be really sure, I'd rather have a stonith device.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ordering Resource Groups V2

2007-12-13 Thread Dominik Klein

Damon Estep wrote:

I have created an order constraint that requires a DRBD/iSCSI target
resource group to be up before an application resource groups comes up.

 


At startup the order is honored, and the resource groups come up in the
desired order.

 


In the event of a failover in the storage group I would like the
application groups to go offline while the storage group fails over to
another node, otherwise the applications will crash because they have
lost access to the storage.

 


The application resource groups do not stop while the storage group
recovers.

 


If two groups are created, call them resource1 and resource2, and then
an order constraint is created where resource1 before resource2, should
resource2 go offline during a failover of resource1? In my test setup
they do not.


Why don't you just use one group? That should give you the intended 
behaviour.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: A question about the combined score

2007-12-12 Thread Dominik Klein
 If the error occurs on the resource, then resouce-failure-stickiness
 will come to play and make your scores:
 Node1: 10 - 10 = 0
 Node2: 9
 As 9  0,  the resource will be started on Node2, and 22 stickiness will
 be added. So you have 31  0.
 
 In your comments, you remarked a failover should occur if the following
 condition met.
 
 Node1_preference + Node1_failcount * failure_stickness
  Node2_preference + Node2_failcount * failure_stickness
 
 Is my understanding correct?

Sounds good to me.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY

2007-12-10 Thread Dominik Klein

I have created simple 2-node cluster with 4 drbd multi-state resources
and xen DomU on it with enabled stonith and setting:


Why don't you use drbd natively in xen?

In your drbd installation, you should find a script named block-drbd. 
Copy that to /etc/xen/scripts and config your domU like:


disk = [ 'drbd:drbd1,hda1,w' ]

That way, you only need the xen resource in heartbeat and don't have to 
worry about master/slave resources at all.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] default_resource_stickiness=INFINITY and default_resource_failure_stickiness=-INFINITY

2007-12-10 Thread Dominik Klein

And to answer to your question:


I have created simple 2-node cluster with 4 drbd multi-state resources
and xen DomU on it with enabled stonith and setting:
default_resource_stickiness=INFINITY and
default_resource_failure_stickiness=-INFINITY.
The idea for cluster working is:
1. promote all drdd resources on node sles236 and start drbd resources
on node sles238. (works OK)
2. when failure of drbd occurs - promote all drbd resources on
sles238 and reboot sles236.
3. when sles236 join back to the cluster after reboot, leave drbd
promoted on sles238.
I thought that setting: default_resource_stickiness=INFINITY and
default_resource_failure_stickiness=-INFINITY guarantee this bahaviour
but in fact I have:
- when sles236 join back to the cluster after reboot, all drbd
resources are demoted on sles238 and promoted on sles36. Where is
mistake in my cib.xml?


...


 constraints
   rsc_location id=pref_location_drbd0 rsc=ms-drbd0
 rule id=sles236_location_drbd0 score=100
boolean_op=and role=Master
   expression attribute=#uname id=drbd0_on_sles236
operation=eq value=sles236/
 /rule
   /rsc_location
   rsc_location id=pref_location_drbd1 rsc=ms-drbd1
 rule id=sles236_location_drbd1 score=100
boolean_op=and role=Master
   expression attribute=#uname id=drbd1_on_sles236
operation=eq value=sles236/
 /rule
   /rsc_location
   rsc_location id=pref_location_drbd2 rsc=ms-drbd2
 rule id=sles236_location_drbd2 score=100
boolean_op=and role=Master
   expression attribute=#uname id=drbd2_on_sles236
operation=eq value=sles236/
 /rule
   /rsc_location
   rsc_location id=pref_location_drbd3 rsc=ms-drbd3
 rule id=sles236_location_drbd3 score=100
boolean_op=and role=Master
   expression attribute=#uname id=drbd3_on_sles236
operation=eq value=sles236/
 /rule


I think this is the cause. You prefer to run drbd on sles236.


   /rsc_location
   rsc_order id=drbd0_before_tr2_xen from=xen_tr2
action=start to=ms-drbd0 to_action=promote/
   rsc_order id=drbd1_before_tr2_xen from=xen_tr2
action=start to=ms-drbd1 to_action=promote/
   rsc_order id=drbd2_before_tr2_xen from=xen_tr2
action=start to=ms-drbd2 to_action=promote/
   rsc_order id=drbd3_before_tr2_xen from=xen_tr2
action=start to=ms-drbd3 to_action=promote/
   rsc_colocation id=col_xen_drbd0_master from=xen_tr2
from_role=Started to=ms-drbd0 to_role=Master score=INFINITY/
   rsc_colocation id=col_xen_drbd1_master from=xen_tr2
from_role=Started to=ms-drbd1 to_role=Master score=INFINITY/
   rsc_colocation id=col_xen_drbd2_master from=xen_tr2
from_role=Started to=ms-drbd2 to_role=Master score=INFINITY/
   rsc_colocation id=col_xen_drbd3_master from=xen_tr2
from_role=Started to=ms-drbd3 to_role=Master score=INFINITY/
 /constraints
   /configuration
 /cib


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd

2007-12-06 Thread Dominik Klein

China wrote:

Ok, now it works, but when the PC_A returns up the resource doesn't remains
on PC_B and failback to PC_A.
How can I configure to switch the first time to PC_B on PC_A failover, but
not return back if PC_A returns UP?


Set resource stickiness to a reasonable value.

Here's roughly how stickiness works:

When you startup, the scores are calculated from your constraints and a 
decision is made where to run the resource (PC_A in your case). 
Afterwards, the stickiness is added to the score for PC_A because the 
resource is running there.


Now PC_A fails. The node with the highest score for your resource gets 
to run it (PC_B). After a successful start, the stickiness is added to 
the score for PC_B.


Now PC_A comes back. It will get its normal score from your constraints, 
but no stickiness because the resource is not running there.


So if you make (PC_B score + stickiness)  (PC_A score), the resource 
will stay on PC_B.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Possible bug in Score calculation?

2007-12-06 Thread Dominik Klein

Hi

sorry I have to bother again about score calculation but I came across 
something I don't understand and that might be a bug.


I have a master-slave drbd resource called ms-drbd (primitive is called 
drbd2) and a group named testdb (4 primitives, mount being the first 
primitive).

pingd multiplier is 300 (5 ping nodes)
default-resource-stickiness is 200

And these are my constraints:
constraints
rsc_order id=drbd2_before_testdb from=testdb 
action=start to=ms-drbd2 to_action=promote/
rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 
to_role=master from=testdb score=300/

/constraints

When I look at the scores, I see

drbd2:0 node1 0
Okay.

drbd2:0 node2 676
Weird
This looks like the crm_master value from the drbd OCF RA plus 
stickiness times (number_of_group_items - 1).


How I come to think this has to do with the group? Because if I add 
another item to the group, the score increases by 1 x stickiness.


Guess this shouldn't be, should it?

drbd2:1 node1 76
crm_master value from drbd OCF RA

drbd2:1 node2 -infinity
Okay.

mount node1 0
Okay.

mount node2 1100
300 (constraint)
+ 4 x 300 (pingd)

Regards
Dominik

Here's the cib.xml

   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 attributes
   nvpair name=no-quorum-policy value=stop 
id=no-quorum-policy/
   nvpair name=symmetric-cluster value=true 
id=symmetric-cluster/
   nvpair name=stonith-enabled value=false 
id=stonith-enabled/
   nvpair name=stonith-action value=reboot 
id=stonith-action/
   nvpair name=default-resource-stickiness value=200 
id=default-resource-stickiness/
   nvpair name=default-resource-failure-stickiness 
value=-100 id=default-resource-failure-stickiness/
   nvpair name=is-managed-default value=true 
id=is-managed-default/
   nvpair name=default-action-timeout value=20s 
id=default-action-timeout/
   nvpair name=stop-orphan-resources value=true 
id=stop-orphan-resources/
   nvpair name=stop-orphan-actions value=true 
id=stop-orphan-actions/
   nvpair name=remove-after-stop value=false 
id=remove-after-stop/
   nvpair name=pe-error-series-max value=-1 
id=pe-error-series-max/
   nvpair name=pe-warn-series-max value=-1 
id=pe-warn-series-max/
   nvpair name=pe-input-series-max value=-1 
id=pe-input-series-max/
   nvpair name=startup-fencing value=true 
id=startup-fencing/

 /attributes
   /cluster_property_set
 /crm_config
resources
master_slave id=ms-drbd2
meta_attributes id=ma-ms-drbd2
attributes
nvpair id=ma-ms-drbd2-1 
name=clone_max value=2/
nvpair id=ma-ms-drbd2-2 
name=clone_node_max value=1/
nvpair id=ma-ms-drbd2-3 
name=master_max value=1/
nvpair id=ma-ms-drbd2-4 
name=master_node_max value=1/
nvpair id=ma-ms-drbd2-5 
name=notify value=yes/
nvpair id=ma-ms-drbd2-6 
name=globally_unique value=false/
nvpair id=ma-ms-drbd2-7 
name=target_role value=started/

/attributes
/meta_attributes
primitive id=drbd2 class=ocf provider=heartbeat 
type=drbd

instance_attributes id=ia-drbd2
attributes
nvpair id=ia-drbd2-1 
name=drbd_resource value=drbd2/

/attributes
/instance_attributes
operations
op id=op-ms-drbd2-1 name=monitor 
interval=5s timeout=5s start_delay=30s role=Master/
op id=op-ms-drbd2-2 name=monitor 
interval=6s timeout=5s start_delay=30s role=Slave/

/operations
/primitive
/master_slave
group id=testdb
meta_attributes id=ma-testdb
attributes
nvpair id=ma-testdb-1 
name=target_role value=started/

/attributes
/meta_attributes
primitive id=mount class=ocf type=filesystem 
provider=heartbeat

...
/primitive
primitive id=postgres class=ocf type=pgsql 
provider=heartbeat

...
/primitive
primitive id=masterip class=ocf type=IPaddr2 
provider=heartbeat

...
/primitive
primitive id=slon class=ocf type=dkcustom 
provider=dk

...
/primitive
/group
/resources
constraints
rsc_order id=drbd2_before_testdb from=testdb 
action=start to=ms-drbd2 to_action=promote/

Re: [Linux-HA] Pingd

2007-12-06 Thread Dominik Klein

With this configuration the resources doesn't failover to test, but remains
on test-ppc. Why?


I can't say. The configuration looks good to me.

But again:

What are you doing to force the failure?

Do you really have just one connection between the nodes and unplug that 
connection to force the failure?


This is meant to cause problems. And you don't even have a STONITH 
configuration which might harm it a little. Still, only one connection 
between the nodes is a problem.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd

2007-12-06 Thread Dominik Klein

China wrote:

Sorry, I forgot it!

I've two connection for the PCs:

one with crossover cable, where heartbeat send packets directly to other PC
one through network, where the services listen and where pingd test
connectivity

When I force the failure I disconnect the network cable that give services
from PC_A.


Use both connections for heartbeat. Maybe use ucast on the external 
interface instead of bcast.


Your supplied config file only reads one interface:
--
*ha.cf:

use_logd yes
compression zlib
coredumps no
keepalive 1
warntime 2
deadtime 5
deadping 5
udpport 694
bcast eth2
node test-ppc test3
ping 192.168.122.113
#respawn hacluster /rnd/apps/components/heartbeat/lib/heartbeat/ipfail
#respawn root /rnd/apps/components/heartbeat/lib/heartbeat/pingd -m 100 -d
5s
#auto_failback off
crm yes
-

But if it really is the way you tell, this should not cause the problem 
you told about.


Just a thought: If - by any chance - you pull the plug out of the 
connection that sends/receives the heartbeats, you will have a 
splitbrain situation which would nicely explain the things you mentioned.


So please use both links for heartbeat cluster communication and make 
sure you pull the right plug :)


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd

2007-12-06 Thread Dominik Klein

But, It's good to use a interface both for heartbeat and for services?


It's pretty common I think.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd

2007-12-06 Thread Dominik Klein

China wrote:

Last question: how can I see what is the node's score during cluster
execution?


You can grep it out of the ptest output.

Or use my script:
http://lists.community.tummy.com/pipermail/linux-ha/2007-September/027488.html

which has been updated by Robert Lindgren:
http://lists.community.tummy.com/pipermail/linux-ha/2007-September/027745.html

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Possible bug in Score calculation?

2007-12-06 Thread Dominik Klein

Good morning Andrew

sorry I have to bother again about score calculation but I came across 
something I don't understand and that might be a bug.


I have a master-slave drbd resource called ms-drbd (primitive is 
called drbd2) and a group named testdb (4 primitives, mount being 
the first primitive).

pingd multiplier is 300 (5 ping nodes)
default-resource-stickiness is 200

And these are my constraints:
constraints
   rsc_order id=drbd2_before_testdb from=testdb 
action=start to=ms-drbd2 to_action=promote/
   rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 
to_role=master from=testdb score=300/

/constraints

When I look at the scores, I see

drbd2:0 node1 0
Okay.

drbd2:0 node2 676
Weird
This looks like the crm_master value from the drbd OCF RA plus 
stickiness times (number_of_group_items - 1).


that would make sense.

How I come to think this has to do with the group? Because if I add 
another item to the group, the score increases by 1 x stickiness.


Guess this shouldn't be, should it?



it should.

the group needs to go with the master... so it makes sense that you 
should take the groups location preferences into account when deciding 
where to place and promote drbd


Full ack on that, but I didn't configure it (at least not that I knew of 
- yet). So it's known and wanted to work this way. Sweet!


Just curious: I suppose its my first constraint that does this job?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Possible bug in Score calculation?

2007-12-06 Thread Dominik Klein

Just curious: I suppose its my first constraint that does this job?


second - the colocation one


Okay thanks - so sure I even got the 50/50 wrong :p

Then I must ask another question: Why does this not apply to colocated 
primitives?


I just tested with a single primitive (testdb) colocated to the master 
role of ms-drbd2:


drbd2:0 node1   0
drbd2:0 node2   76
drbd2:1 node1   76
drbd2:1 node2   -infin
testdb node1   0
testdb node2   500

resources
master_slave id=ms-drbd2
meta_attributes id=ma-ms-drbd2
attributes
nvpair id=ma-ms-drbd2-1 
name=clone_max value=2/
nvpair id=ma-ms-drbd2-2 
name=clone_node_max value=1/
nvpair id=ma-ms-drbd2-3 
name=master_max value=1/
nvpair id=ma-ms-drbd2-4 
name=master_node_max value=1/
nvpair id=ma-ms-drbd2-5 
name=notify value=yes/
nvpair id=ma-ms-drbd2-6 
name=globally_unique value=false/
nvpair id=ma-ms-drbd2-7 
name=target_role value=started/

/attributes
/meta_attributes
primitive id=drbd2 class=ocf provider=heartbeat 
type=drbd

instance_attributes id=ia-drbd2
attributes
nvpair id=ia-drbd2-1 
name=drbd_resource value=drbd2/

/attributes
/instance_attributes
operations
op id=op-ms-drbd2-1 name=monitor 
interval=5s timeout=5s start_delay=30s role=Master/
op id=op-ms-drbd2-2 name=monitor 
interval=6s timeout=5s start_delay=30s role=Slave/

/operations
/primitive
/master_slave
primitive id=testdb class=ocf type=filesystem 
provider=heartbeat

instance_attributes id=ia-mount
attributes
nvpair id=ia-mount-1 
name=device value=/dev/drbd2/
nvpair id=ia-mount-2 
name=directory value=/packages/postgres/data//
nvpair id=ia-mount-3 
name=fstype value=ext3/

/attributes
/instance_attributes
operations
op id=op-mount-1 name=monitor 
interval=5s timeout=5s start_delay=30s role=Started/

/operations
/primitive
/resources
constraints
rsc_order id=drbd2_before_testdb from=testdb 
action=start to=ms-drbd2 to_action=promote/
rsc_colocation id=testdb_on_drbd2 to=ms-drbd2 
to_role=master from=testdb score=300/

/constraints
/configuration
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 1000 extra score for a group?

2007-11-15 Thread Dominik Klein
i defined a rsc_location constraint with a rule with score 100 for a 
particular node for my HA-group.

Resource stickiness is 200.
Furthermore I use the a rule with score_attribute=pingd 
(multiplier=100) for the group.
With 5 available ping nodes this should make 100+200+500=800. But the 
score I see is 1800.


So somehow 1000 were added here.


how many resources in the group?  6 by any chance?

that would make the stickiness 6*200 = 1200 (instead of 200) and explain 
the extra 1000.


You hit the nail on the head. Thanks for explaining this.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] RFC: Change on OCF RA Filesystem's monitor action

2007-11-09 Thread Dominik Klein

Grepping by device doesn't work for mount-by-label at least, and
requires a lot of escaping for networked mounts; so we thought grepping
for the mountpoint was exactly the approach we needed to take.


Never used mount-by-label. Good point though.


I've got to admit I've never had someone use a symlink as a mountpoint.
;-) 


/usr/local/postgres/data is my mountpoint.
/usr/local/postgres points to /usr/local/postgres-version

I think this is not too uncommon.

On the other hand I could just symlink /usr/local/postgres/data to 
/usr/local/data and let that be my mountpoint. Or change my postgres 
config to use another data dir ...



Maybe the right answer would be for the RA to dereference the
symlink then?


Well one would have to check each directory in the path separately I 
think. At least I don't know a better way that could find out wether 
there is a symlink in the path or not.


I think this would be a lot of work for not too much effort. And 
propably a slower monitor action.


Imho: Let's forget about my suggestion.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stdout and stderr redirection in a resource agent

2007-11-08 Thread Dominik Klein

nohup $binfile $cmdline_options  $logfile 2 $errorlogfile 
^^  ^^^


That binary is invoked from the RA, right?


Sure.

So neither stdout nor stderr should be said so Linux-HA should not log 
it. I'm just curious why it does.


Well, despite your redirections, something shows up on stdout (or
stderr). If it does, then it's logged. I really can't say why it
does show up though.


I hate to admit it, but it was an error in my RA. What else ...

The RA supports to configure $logfile and $errlogfile.
If both is set, then the syntax above is used and actually works.

If $errlogfile is not set, then both stdout and stderr go to $logfile. 
But apparently there was a syntax error and that was the reason for 
stderr showing in the logs.


Thanks though.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] RFC: Change on OCF RA Filesystem's monitor action

2007-11-07 Thread Dominik Klein

Hi

I would like to suggest a change at the Filesystem RA.

The monitor action actually does something like grep $MOUNTPOINT 
/etc/mtab. This does not work if you use a symbolic link as a mountpoint.


If instead it would grep for $DEVICE (maybe with -w to avoid problems 
with +10 partitions on one disc), this would still find out if the 
filesystem is mounted and solve that problem.


What do you think?

Regards
Dominik

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] stdout and stderr redirection in a resource agent

2007-11-07 Thread Dominik Klein

Hi ... again ...

I wrote my own little RA to start a custom binary. Very basic RA up to now.

I start my binary with
nohup $binfile $cmdline_options  $logfile 2 $errorlogfile 

Works ok actually, the logfiles are filled as expected, but I also see 
some of the output in the Linux-HA log (/var/log/messages).


---
lrmd: [2618]: info: RA output: 
---

How to avoid this?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stdout and stderr redirection in a resource agent

2007-11-07 Thread Dominik Klein

Dejan Muhamedagic wrote:

Hi,

On Wed, Nov 07, 2007 at 04:17:45PM +0100, Dominik Klein wrote:

Hi ... again ...

I wrote my own little RA to start a custom binary. Very basic RA up to now.

I start my binary with
nohup $binfile $cmdline_options  $logfile 2 $errorlogfile 

Works ok actually, the logfiles are filled as expected, but I also see 
some of the output in the Linux-HA log (/var/log/messages).


---
lrmd: [2618]: info: RA output: 
---

How to avoid this?


Currently there is no way around that. We assume that if the RA
says something, perhaps it's important, so it is logged. Why
would you want to avoid it?


I start my binary with

nohup $binfile $cmdline_options  $logfile 2 $errorlogfile 
^^  ^^^

So neither stdout nor stderr should be said so Linux-HA should not log 
it. I'm just curious why it does.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?

2007-11-06 Thread Dominik Klein

Hi

a week earlier I asked wether there was a resource agent that implements 
Master/Slave for a Postgres Cluster using slony-1 replication.


There was not, so I tried to implement it myself.

I want to report back to give an explanation and reference on why I 
think it is not possible (at the moment) to implement this in heartbeat.


Here we go:

Short summary of slony-1 replication:
In a slony-1 replication setup
* Tables are put together to replication sets
* Each set has an origin (master)
* Only the origin can be written to
* There can be multiple sets with a different origin each
* There can be multiple subscribers (slaves) for each set
* Subscribers are read-only

As you have to somewhat connect the master role to the health of 
postgres itself, this restricts you to the use of only one set or manage 
all sets at once. Well, okay, I think I could live with this.


Slony-1 implements two commands for switchover and failover. I mean 
Switchover when I want to do a planned switch of roles when all machines 
are healthy. I mean failover when the Master has a problem and the Slave 
takes over.


So now comes the tricky part.
In slony-1 you cannot make an origin a subscriber without making another 
subscriber the new origin. This happens in ONE command. So there are no 
independent demote and promote commands. In a two machine setup you 
cannot have two slaves at a time.


In other words: Promote implicitely demotes the other machine, 
Demote implicitely promotes the other machine.


So I thought I could implement demote as return 0, as promote on 
the other machine will do the job anyway. Well, not the best idea as a 
monitor action on the apparently demoted machine will still return 
Master Status until promote on the second machine finished.


Furthermore, the switchover command will fail if the other machine is 
not responding. In case the current master really has a problem, all you 
can do get a writeable database on the current slave is to use the 
failover command. But Linux-HA only knows promote and demote.


So I implemented some promote and demote the following way:

 promote
if switchover_to_me
then
return 0
else
if ! switchover_to_me
then
failover_to_me
return $?
fi
fi


 demote
switchover_to_other_machine
# dont care if this works as it cannot work if
# the other machine is not healthy
return 0


What you also need to know about slony-1 is the fact that you need to 
resync the COMPLETE data after a failover. In slony-1 it is not possible 
to let a failed node rejoin the slony-Cluster (even if it was healthy 
when the failover command was issued). It has to fetch ALL data from the 
new master. So you want to avoid failover if it is not absolutely necessary.


Up to now I thought my RA could handle a few cases and it turns out: 
SOME it can handle (like master reboot or slave reboot or controlled 
switchover). But a simple thing as killing postgres on the master 
machine causes a failover. Why?:


Say A is master, B is slave at this moment

1. monitor on A fails
2. Linux-HA executes demote on A
- As you see above, this will work even if it does nothing
3. Linux-HA executes promote on B
- This, as postgres on A is not running, will end up in a failover (see 
above)


This is pretty much it. If you have any ideas on how to improve this or 
if you also think that this is impossible with the current master/slave 
implementation in Linux-HA - please respond.


The whole separately demote and promote approach in Linux-HA seems to 
just not fit the way slony-1 handles switchover and failover.


If you have any more questions (it can well be I forgot something), just 
ask - I'll be happy to help improve Linux-HA.


Best regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?

2007-11-06 Thread Dominik Klein

Hi Andrew

thanks for your reply.

So I thought I could implement demote as return 0, as promote on 
the other machine will do the job anyway. Well, not the best idea as a 
monitor action on the apparently demoted machine will still return 
Master Status until promote on the second machine finished.


What if the crm delayed the slave's monitor until after the other side 
was promoted... would that help significantly?


That would propably prevent one failed monitor action in this very 
special case.


Furthermore, the switchover command will fail if the other machine is 
not responding. In case the current master really has a problem, all 
you can do get a writeable database on the current slave is to use the 
failover command. But Linux-HA only knows promote and demote.


So I implemented some promote and demote the following way:

 promote
if switchover_to_me
then
return 0
else
if ! switchover_to_me
then
failover_to_me
return $?
fi
fi


 demote
switchover_to_other_machine
# dont care if this works as it cannot work if
# the other machine is not healthy
return 0


What you also need to know about slony-1 is the fact that you need to 
resync the COMPLETE data after a failover. In slony-1 it is not 
possible to let a failed node rejoin the slony-Cluster (even if it was 
healthy when the failover command was issued). It has to fetch ALL 
data from the new master. So you want to avoid failover if it is not 
absolutely necessary.


Up to now I thought my RA could handle a few cases and it turns out: 
SOME it can handle (like master reboot or slave reboot or controlled 
switchover). But a simple thing as killing postgres on the master 
machine causes a failover. Why?:


Say A is master, B is slave at this moment

1. monitor on A fails
2. Linux-HA executes demote on A
- As you see above, this will work even if it does nothing
3. Linux-HA executes promote on B
- This, as postgres on A is not running, will end up in a failover 
(see above)


Notifications might help.
The Filesystem agent (when operating in OCFS2 mode) keeps a list of who 
its peers are.
If you did the same then I think you'd be able to recognize that you're 
all alone and that it was ok to switchover_to_me instead.


Read my first post again. Switchover is not possible if the other 
postgres instance is not available. The only way to make a single slave 
the new master is to use the failover command.


What *would* help here is:

1. monitor on A fails - OCF_NOT_RUNNING
Now, instead of demote A, promote B:
2. Stop/Start the resource on A
Iirc start includes a monitor action (or probe called sometimes in 
this case). This would report OCF_RUNNING_MASTER, so the problem would 
be solved.


On the other hand, this is propably a pretty big change in Linux-HA's 
master/slave handling and this should be discussed.


Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


<    1   2   3   4   >