Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
Juha Heinanen wrote:
> Juha Heinanen writes:
> 
>  > the real problem is that start of mysql server by pacemaker stops
>  > altogether after a few manual stops (/etc/init.d/mysql stop).
> 
> i think i figured this out.  when pacemaker needed to start my
> mysql-server resource three times on node lenny1, it migrated the group
> to node lenny2.  when i then repeated stoping of mysql-server on lenny2,
> it migrated the group back to lenny1, but didn't start mysql-server,
> because it remembered that it had already started it there 3 times.
> 
> if so, my conclusion is to forget migration-threshold parameter.

That sounds about right.

You can configure a failure-timeout. That's an amount of time after
which the cluster forgets about failures.

Read up on failure timeout and don't miss the section "how to ensure
time based rules take effect" in the pdf documentation.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Adam Gandelman


I don't know what your CIB xml config looks like but, if you're forcing 
collocation within a group,  you may want to try using the drbbdisk RA 
provided by heartbeat instead of the drbd CRM OCF script.



Heiko Schellhorn wrote:

Hi

I installed drbd (8.0.14) together with heartbeat (2.0.8) on a Gentoo-system.

I have following problem:
Standalone the drbd resource works perfectly. I can mount/unmount it alternate 
on both nodes. Reading/writing works and /proc/drbd looks fine.


But when I start heartbeat it degrades the resource step by step until it's 
marked as unconfigured. An excerpt of the logfile is attached.
Heartbeat itself starts up and runs. Two of the three resources configured up 
to now are also working. Only drbd shows problems. (See the file  
crm_mon-out)


I don't think it's a problem of communication between the nodes because drbd 
is working standalone and e.g. the IPaddr2 resource is also working within 
heartbeat.
I also tried several heartbeat-configurations. First I defined the resources 
as single resources and then I combined the resources to a resource group.

There was no difference.

Has someone seen such an issue before? Any ideas ?  
I didn't find anything helpful in the list archive.

If you need more informations I can provide a complete log and the config.

Thanks

Heiko

  






Last updated: Mon Mar 23 13:11:49 2009
Current DC: mainsrv2 (d7bd5c11-babc-4b69-97d6-3d20d01d8d66)
2 Nodes configured.
1 Resources configured.


Node: mainsrv2 (d7bd5c11-babc-4b69-97d6-3d20d01d8d66): online
Node: mainsrv1 (6a5eacba-7389-4305-9074-de6116504c49): online

Resource Group: heartbeat_group_1
resource_IP (heartbeat::ocf:IPaddr2):   Started mainsrv1
resource_drbd   (heartbeat::ocf:drbd):  Started mainsrv1
fs_drbd	(heartbeat::ocf:Filesystem):	Stopped 
  



drbd[30750][30763]: 2009/03/23_12:42:19 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30750][30772]: 2009/03/23_12:42:19 DEBUG: dr0: Exit code 0
drbd[30750][30778]: 2009/03/23_12:42:19 DEBUG: dr0: Command output: 
Secondary/Secondary
drbd[30750][30794]: 2009/03/23_12:42:19 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30750][30798]: 2009/03/23_12:42:19 DEBUG: dr0: Exit code 0
drbd[30750][30799]: 2009/03/23_12:42:19 DEBUG: dr0: Command output: Connected
drbd[30750][30800]: 2009/03/23_12:42:19 DEBUG: dr0 status: Secondary/Secondary 
Secondary Secondary Connected
drbd[30808][30815]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30808][30819]: 2009/03/23_12:42:20 DEBUG: dr0: Exit code 0
drbd[30808][30820]: 2009/03/23_12:42:20 DEBUG: dr0: Command output: 
Secondary/Unknown
drbd[30808][30830]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30808][30836]: 2009/03/23_12:42:20 DEBUG: dr0: Exit code 0
drbd[30808][30837]: 2009/03/23_12:42:20 DEBUG: dr0: Command output: WFConnection
drbd[30808][30839]: 2009/03/23_12:42:20 DEBUG: dr0 status: Secondary/Unknown 
Secondary Unknown WFConnection
drbd[30808][30841]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf down dr0
drbd[30808][30873]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30808][30874]: 2009/03/23_12:42:21 DEBUG: dr0: Command output:
drbd[30808][30875]: 2009/03/23_12:42:21 DEBUG: dr0 stop: drbdadm down succeeded.
drbd[30876][30883]: 2009/03/23_12:42:21 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30876][30888]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30876][30889]: 2009/03/23_12:42:21 DEBUG: dr0: Command output: Unconfigured
drbd[30876][30897]: 2009/03/23_12:42:21 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30876][30901]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30876][30902]: 2009/03/23_12:42:21 DEBUG: dr0: Command output: Unconfigured
drbd[30876][30903]: 2009/03/23_12:42:21 DEBUG: dr0 status: Unconfigured 
Unconfigured Unconfigured Unconfigured
drbd[30876][30904]: 2009/03/23_12:42:21 DEBUG: dr0 start: already configured.

  



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] STONITH: internal vs. external

2009-03-23 Thread Dejan Muhamedagic
Hi,

On Mon, Mar 23, 2009 at 09:41:38AM -0700, Ethan Bannister wrote:
> 
> This may turn out to be a silly question, but what is the true difference
> between and external STONITH plugin and an internal STONITH plugin.

"internal" (not a good name) plugins are locked in memory on
start in order to have them function even in situations when
memory is tight. So, they are a bit better than external plugins.
But nowadays, with enough memory, there's not much difference
between the two.

> I have
> a SAN set up for fail-over, and everything looks like it is working as it
> should.  However, I would like to use STONITH to prevent split-brain.  I do
> not have a STONITH device so I would need to use something like meatware
> (which does not allow automatic fail-over) or ssh.  But there are two types,
> ssh and external/ssh.  What is the difference?  I will try to do some
> research in the meantime, but if I get an answer before I find out myself,
> that would be greatly appreciated.  Also, if I were to use ssh as a STONITH
> plugin, will my machine automatically migrate resources to the other
> machine?

You should never use ssh for production clusters. It is not
reliable. It is good for testing only. If you have a SAN, you can
try external/sbd.

Thanks,

Dejan

> Thanks for any help you can provide :)
> -- 
> View this message in context: 
> http://www.nabble.com/STONITH%3A-internal-vs.-external-tp22663871p22663871.html
> Sent from the Linux-HA mailing list archive at Nabble.com.
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] suicide in no-quorum-policy

2009-03-23 Thread Dejan Muhamedagic
Hi,

On Fri, Mar 20, 2009 at 05:09:00PM +0100, Michael Schwartzkopff wrote:
> Hi,
> 
> in the metadata of the pengine I found the option no-quorum-policy which can 
> be set to "suicide".
> 
> What exactly does the node do, when this option is set up suicide?

Should commit suicide, i.e. reboot. But I never tried that.

> Has STONITH to be configured to make this option work?

No.

> As far as I remenber, there once was the dicsussion, that no node can commit 
> suicide via STONITH. Is this still valid?

Yes. With the exception for the suicide plugin. But I'd recommend
using a "real" device for stonith.

> Makes this option sense, If STONITH 
> is used? Or a some other mechanism used?

Normally, stonith should take care of that.

Thanks,

Dejan

> Thanks the enlightening answers.
> 
> -- 
> Dr. Michael Schwartzkopff
> MultiNET Services GmbH
> Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
> Tel: +49 - 89 - 45 69 11 0
> Fax: +49 - 89 - 45 69 11 21
> mob: +49 - 174 - 343 28 75
> 
> mail: mi...@multinet.de
> web: www.multinet.de
> 
> Sitz der Gesellschaft: 85630 Grasbrunn
> Registergericht: Amtsgericht M?nchen HRB 114375
> Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens
> 
> ---
> 
> PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
> Skype: misch42
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] STONITH: internal vs. external

2009-03-23 Thread Ethan Bannister

This may turn out to be a silly question, but what is the true difference
between and external STONITH plugin and an internal STONITH plugin.  I have
a SAN set up for fail-over, and everything looks like it is working as it
should.  However, I would like to use STONITH to prevent split-brain.  I do
not have a STONITH device so I would need to use something like meatware
(which does not allow automatic fail-over) or ssh.  But there are two types,
ssh and external/ssh.  What is the difference?  I will try to do some
research in the meantime, but if I get an answer before I find out myself,
that would be greatly appreciated.  Also, if I were to use ssh as a STONITH
plugin, will my machine automatically migrate resources to the other
machine?

Thanks for any help you can provide :)
-- 
View this message in context: 
http://www.nabble.com/STONITH%3A-internal-vs.-external-tp22663871p22663871.html
Sent from the Linux-HA mailing list archive at Nabble.com.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Juha Heinanen
Juha Heinanen writes:

 > the real problem is that start of mysql server by pacemaker stops
 > altogether after a few manual stops (/etc/init.d/mysql stop).

i think i figured this out.  when pacemaker needed to start my
mysql-server resource three times on node lenny1, it migrated the group
to node lenny2.  when i then repeated stoping of mysql-server on lenny2,
it migrated the group back to lenny1, but didn't start mysql-server,
because it remembered that it had already started it there 3 times.

if so, my conclusion is to forget migration-threshold parameter.

-- juha
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Juha Heinanen
Dominik Klein writes:

 > Heartbeat will for example no longer be part of the next suse enterprise
 > linux (sles11) ha solution. It will be based on openais. So for new
 > setups, this should be the way to go - at least imho.

yes, after there are packages available for debian lenny. 

-- juha
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Juha Heinanen
Dominik Klein writes:

 > I read your email on the pacemaker list and from what you've shared and
 > explained, i cannot spot find a configuration issue. It should just work
 > like that (and does work like that for me).

i did more experiments and noticed that migration-threshold=N doesn't
work as i thought it would.  i thought that if starting of a resource
fails N times, the group of the resource will migrate to the other node.

what happens instead is that if N is 3, for example, and i stop the
resource (e.g., mysql server) three times, pacemaker will start it two
times on the original node and on third start migrates the resources to
the other one even if start worked fine each time.

is there a means to achieve the migration only when start failed N
times?

 > Maybe post your entire configuration, preferrably a hb_report
 > archive.

i think i had a bug in my crm during the earlier tests.  i had set
migration-threshold on an individual resource (mysql-server)

crm_resource --meta --resource mysql-server --set-parameter migration-threshold 
--property-value 3

instead of the whole group.  now i have

group mysql-server-group fs0 virtual-ip mysql-server \
meta migration-threshold="3"

and migration of the resources takes place after third start.  complete
config is below.

the real problem is that start of mysql server by pacemaker stops
altogether after a few manual stops (/etc/init.d/mysql stop).

here is an example.  i stop mysql and all other resources are started on
the other node except mysql server:

crmd[9940]: 2009/03/23_19:33:23 info: send_direct_ack: ACK'ing resource op 
drbd0:0_monitor_6 from 5:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc: 
lrm_invoke-lrmd-1237829603-11
crmd[9940]: 2009/03/23_19:33:23 info: do_lrm_rsc_op: Performing 
key=59:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_notify_0 )
lrmd[9937]: 2009/03/23_19:33:23 info: rsc:drbd0:0: notify
crmd[9940]: 2009/03/23_19:33:23 info: do_lrm_rsc_op: Performing 
key=61:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_notify_0 )
crmd[9940]: 2009/03/23_19:33:24 info: process_lrm_event: LRM operation 
drbd0:0_monitor_6 (call=31, rc=-2, cib-update=0, confirmed=true) Cancelled 
unknown exec error
lrmd[9937]: 2009/03/23_19:33:24 info: rsc:drbd0:0: notify
crmd[9940]: 2009/03/23_19:33:24 info: process_lrm_event: LRM operation 
drbd0:0_notify_0 (call=32, rc=0, cib-update=49, confirmed=true) complete ok
crmd[9940]: 2009/03/23_19:33:24 info: process_lrm_event: LRM operation 
drbd0:0_notify_0 (call=33, rc=0, cib-update=50, confirmed=true) complete ok
crmd[9940]: 2009/03/23_19:33:26 info: do_lrm_rsc_op: Performing 
key=62:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_notify_0 )
lrmd[9937]: 2009/03/23_19:33:26 info: rsc:drbd0:0: notify
crmd[9940]: 2009/03/23_19:33:26 info: do_lrm_rsc_op: Performing 
key=13:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_promote_0 )
crm_master[13804]: 2009/03/23_19:33:26 info: Invoked: /usr/sbin/crm_master -l 
reboot -v 75 
lrmd[9937]: 2009/03/23_19:33:27 info: RA output: (drbd0:0:notify:stdout) 0 
Trying master-drbd0:0=75 update via attrd
lrmd[9937]: 2009/03/23_19:33:27 info: rsc:drbd0:0: promote
crmd[9940]: 2009/03/23_19:33:27 info: process_lrm_event: LRM operation 
drbd0:0_notify_0 (call=34, rc=0, cib-update=51, confirmed=true) complete ok
lrmd[9937]: 2009/03/23_19:33:27 info: RA output: (drbd0:0:promote:stdout) 

drbd[13811]:2009/03/23_19:33:27 INFO: drbd0 promote: primary succeeded
crmd[9940]: 2009/03/23_19:33:27 info: process_lrm_event: LRM operation 
drbd0:0_promote_0 (call=35, rc=0, cib-update=52, confirmed=true) complete ok
crmd[9940]: 2009/03/23_19:33:29 info: do_lrm_rsc_op: Performing 
key=60:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_notify_0 )
lrmd[9937]: 2009/03/23_19:33:29 info: rsc:drbd0:0: notify
crm_master[13983]: 2009/03/23_19:33:29 info: Invoked: /usr/sbin/crm_master -l 
reboot -v 75 
lrmd[9937]: 2009/03/23_19:33:29 info: RA output: (drbd0:0:notify:stdout) 0 
Trying master-drbd0:0=75 update via attrd
crmd[9940]: 2009/03/23_19:33:29 info: process_lrm_event: LRM operation 
drbd0:0_notify_0 (call=36, rc=0, cib-update=53, confirmed=true) complete ok
crmd[9940]: 2009/03/23_19:33:31 info: do_lrm_rsc_op: Performing 
key=44:8:0:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=fs0_start_0 )
lrmd[9937]: 2009/03/23_19:33:31 info: rsc:fs0: start
crmd[9940]: 2009/03/23_19:33:31 info: do_lrm_rsc_op: Performing 
key=14:8:8:84c3fc98-c640-4a3f-b0ea-c1f17e5f73bc op=drbd0:0_monitor_59000 )
Filesystem[13990]:  2009/03/23_19:33:31 INFO: Running start for /dev/drbd0 
on /var/lib/mysql
crmd[9940]: 2009/03/23_19:33:31 info: process_lrm_event: LRM operation 
drbd0:0_monitor_59000 (call=38, rc=8, cib-update=54, confirmed=false) complete 
master
crmd[9940]: 2009/03/23_19:33:31 info: process_lrm_event: LRM operation 
fs0_start_0 (call=37, rc=0, cib-update=55, confirmed=true) complete ok
crmd[9940]: 2009/03/23_19:33:33 info: do_lrm_rsc_op: Performing 
key=46:8:0:84c3fc98-c640-4a3f-b0ea-c

Re: [Linux-HA] Beginner questions

2009-03-23 Thread Les Mikesell

Dominik Klein wrote:

Is there some documentation available for openais?  I can't even find a
good description of what it does or why you would use it.  Also, will
this help with my 2nd question: having a few spares for a large number
of servers?  While my objective with the squid cache is to proxy
everything through one server to maximize the cache hits, I may switch
to memcached on a group of machines and would like to have a standby or
2 that could take over for any failing machine.


Well, there are man-pages and the mailing list. The install page even
has a configuration example. And I have found this thread to be
especially helpful:
https://lists.linux-foundation.org/pipermail/openais/2009-March/010894.html


Yes, but I want to know why I should use it before dealing with how to 
install and configure.  Is there a feature list, FAQ, or comparison to 
other mechanisms?



openais will be the future platform for pacemaker clusters providing the
communication infrastructure and node failure detection.

Heartbeat will for example no longer be part of the next suse enterprise
linux (sles11) ha solution. It will be based on openais. So for new
setups, this should be the way to go - at least imho.


The code may be great, but it really needs a little public relations 
effort unless I'm missing something.  Is there any way to find the 
answer to my question above (many active hosts per spare)?


--
  Les Mikesell
   lesmikes...@gmail.com

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
> Is there some documentation available for openais?  I can't even find a
> good description of what it does or why you would use it.  Also, will
> this help with my 2nd question: having a few spares for a large number
> of servers?  While my objective with the squid cache is to proxy
> everything through one server to maximize the cache hits, I may switch
> to memcached on a group of machines and would like to have a standby or
> 2 that could take over for any failing machine.

Well, there are man-pages and the mailing list. The install page even
has a configuration example. And I have found this thread to be
especially helpful:
https://lists.linux-foundation.org/pipermail/openais/2009-March/010894.html

openais will be the future platform for pacemaker clusters providing the
communication infrastructure and node failure detection.

Heartbeat will for example no longer be part of the next suse enterprise
linux (sles11) ha solution. It will be based on openais. So for new
setups, this should be the way to go - at least imho.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Les Mikesell

Dominik Klein wrote:

 My first HA setup is for a squid proxy where all I need is to move an IP
address to a backup server if the primary fails (and the cache can just
rebuild on its own).  This seems to work, but will only fail if the
machine goes down completely or the primary IP is unreachable.  Is that
typical or are there monitors for the service itself so failover would
happen if the squid process is not running or stops accepting connections?

Second question (unrelated):  Can heartbeat be set up so one or two
spare machines could automatically take over the IP address of any of a
much larger pool of machines that might fail?



Heartbeat in v1 mode (haresources configuration) cannot do any resource
level monitoring itself. You'd need to do that externally by any means.

If you're just starting out learning now, I'd suggest going with openais
and pacemaker instead of heartbeat right away. Check out the
documentation on www.clusterlabs.org/wiki/install and
www.clusterlabs.org/wiki/Documentation


Is there some documentation available for openais?  I can't even find a 
good description of what it does or why you would use it.  Also, will 
this help with my 2nd question: having a few spares for a large number 
of servers?  While my objective with the squid cache is to proxy 
everything through one server to maximize the cache hits, I may switch 
to memcached on a group of machines and would like to have a standby or 
2 that could take over for any failing machine.


--
  Les Mikesell
   lesmikes...@gmail.com

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Dominik Klein
Dominik Klein wrote:
> You cannot use drbd in heartbeat the way you configured it.
> 
> Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 

Sorry, copy/paste error. I meant to say

http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Dominik Klein
You cannot use drbd in heartbeat the way you configured it.

Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 and (if that
wasn't made clear enough on the page) make sure the first thing you do
is upgrade your cluster software. Read here on how to do that:
http://clusterlabs.org/wiki/Install

Regards
Dominik

Heiko Schellhorn wrote:
> Hi
> 
> I installed drbd (8.0.14) together with heartbeat (2.0.8) on a Gentoo-system.
> 
> I have following problem:
> Standalone the drbd resource works perfectly. I can mount/unmount it 
> alternate 
> on both nodes. Reading/writing works and /proc/drbd looks fine.
> 
> But when I start heartbeat it degrades the resource step by step until it's 
> marked as unconfigured. An excerpt of the logfile is attached.
> Heartbeat itself starts up and runs. Two of the three resources configured up 
> to now are also working. Only drbd shows problems. (See the file  
> crm_mon-out)
> 
> I don't think it's a problem of communication between the nodes because drbd 
> is working standalone and e.g. the IPaddr2 resource is also working within 
> heartbeat.
> I also tried several heartbeat-configurations. First I defined the resources 
> as single resources and then I combined the resources to a resource group.
> There was no difference.
> 
> Has someone seen such an issue before? Any ideas ?
> I didn't find anything helpful in the list archive.
> 
> If you need more informations I can provide a complete log and the config.
> 
> Thanks
> 
> Heiko
> 
> 
> 
> 
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Heiko Schellhorn
Hi

I installed drbd (8.0.14) together with heartbeat (2.0.8) on a Gentoo-system.

I have following problem:
Standalone the drbd resource works perfectly. I can mount/unmount it alternate 
on both nodes. Reading/writing works and /proc/drbd looks fine.

But when I start heartbeat it degrades the resource step by step until it's 
marked as unconfigured. An excerpt of the logfile is attached.
Heartbeat itself starts up and runs. Two of the three resources configured up 
to now are also working. Only drbd shows problems. (See the file  
crm_mon-out)

I don't think it's a problem of communication between the nodes because drbd 
is working standalone and e.g. the IPaddr2 resource is also working within 
heartbeat.
I also tried several heartbeat-configurations. First I defined the resources 
as single resources and then I combined the resources to a resource group.
There was no difference.

Has someone seen such an issue before? Any ideas ?  
I didn't find anything helpful in the list archive.

If you need more informations I can provide a complete log and the config.

Thanks

Heiko

-- 
---
Dipl. Inf. Heiko Schellhorn

University of BremenRoom:  NW1-U 2065
Inst. of Environmental Physics  Phone: +49(0)421 218 4080
P.O. Box 33 04 40   Fax:   +49(0)421 218 4555
D-28334 Bremen  Mail:  mailto:sch...@physik.uni-bremen.de
Germany www:   http://www.iup.uni-bremen.de
   http://www.sciamachy.de
   http://www.geoscia.de



Last updated: Mon Mar 23 13:11:49 2009
Current DC: mainsrv2 (d7bd5c11-babc-4b69-97d6-3d20d01d8d66)
2 Nodes configured.
1 Resources configured.


Node: mainsrv2 (d7bd5c11-babc-4b69-97d6-3d20d01d8d66): online
Node: mainsrv1 (6a5eacba-7389-4305-9074-de6116504c49): online

Resource Group: heartbeat_group_1
resource_IP (heartbeat::ocf:IPaddr2):   Started mainsrv1
resource_drbd   (heartbeat::ocf:drbd):  Started mainsrv1
fs_drbd (heartbeat::ocf:Filesystem):Stopped 
drbd[30750][30763]: 2009/03/23_12:42:19 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30750][30772]: 2009/03/23_12:42:19 DEBUG: dr0: Exit code 0
drbd[30750][30778]: 2009/03/23_12:42:19 DEBUG: dr0: Command output: 
Secondary/Secondary
drbd[30750][30794]: 2009/03/23_12:42:19 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30750][30798]: 2009/03/23_12:42:19 DEBUG: dr0: Exit code 0
drbd[30750][30799]: 2009/03/23_12:42:19 DEBUG: dr0: Command output: Connected
drbd[30750][30800]: 2009/03/23_12:42:19 DEBUG: dr0 status: Secondary/Secondary 
Secondary Secondary Connected
drbd[30808][30815]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30808][30819]: 2009/03/23_12:42:20 DEBUG: dr0: Exit code 0
drbd[30808][30820]: 2009/03/23_12:42:20 DEBUG: dr0: Command output: 
Secondary/Unknown
drbd[30808][30830]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30808][30836]: 2009/03/23_12:42:20 DEBUG: dr0: Exit code 0
drbd[30808][30837]: 2009/03/23_12:42:20 DEBUG: dr0: Command output: WFConnection
drbd[30808][30839]: 2009/03/23_12:42:20 DEBUG: dr0 status: Secondary/Unknown 
Secondary Unknown WFConnection
drbd[30808][30841]: 2009/03/23_12:42:20 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf down dr0
drbd[30808][30873]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30808][30874]: 2009/03/23_12:42:21 DEBUG: dr0: Command output:
drbd[30808][30875]: 2009/03/23_12:42:21 DEBUG: dr0 stop: drbdadm down succeeded.
drbd[30876][30883]: 2009/03/23_12:42:21 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf state dr0
drbd[30876][30888]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30876][30889]: 2009/03/23_12:42:21 DEBUG: dr0: Command output: Unconfigured
drbd[30876][30897]: 2009/03/23_12:42:21 DEBUG: dr0: Calling /sbin/drbdadm -c 
/etc/drbd.conf cstate dr0
drbd[30876][30901]: 2009/03/23_12:42:21 DEBUG: dr0: Exit code 0
drbd[30876][30902]: 2009/03/23_12:42:21 DEBUG: dr0: Command output: Unconfigured
drbd[30876][30903]: 2009/03/23_12:42:21 DEBUG: dr0 status: Unconfigured 
Unconfigured Unconfigured Unconfigured
drbd[30876][30904]: 2009/03/23_12:42:21 DEBUG: dr0 start: already configured.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] expected-quorum-votes

2009-03-23 Thread Dominik Klein
> crmd metadata tells me that expected-quorum-votes
> are used to calculate quorum in openais based clusters. Its default value is 
> 2. Do I have to change this value if I have 3 or more nodes in a OpenAIS 
> based 
> cluster?

No. It is automatically adjusted by the cluster.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] maintenance-mode of pengine

2009-03-23 Thread Dominik Klein
Michael Schwartzkopff wrote:
> Hi,
> 
> In the metadata of the pengine I found the attribute maintenance-mode. I did 
> not find any documentation about it. The long description also says: "Should 
> the cluster ...". Anybody knows what this options does?
> 
> Thanks.

It disables resource management when set to true. Like
"is-managed-default" did in the old days, plus, irrc, it also disables
all ops. But better let Andrew verify the latter.

Regards
Dominik


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
Juha Heinanen wrote:
> Dominik Klein writes:
> 
>  > Heartbeat in v1 mode (haresources configuration) cannot do any resource
>  > level monitoring itself. You'd need to do that externally by any
>  > means.
> 
> yes, in v2 mode i have managed to make pacemaker to monitor resources,
> for example, like this:
> 
> primitive test lsb:test \
>   op monitor interval="30s" timeout="5s" \
>   meta target-role="Started"
> 
> but i still have failed to find out how to make pacemaker to migrate
> a resource group to another node if one of the resources in the group
> fails to start.
> 
> for example, if test is the last member of group
> 
> group test-group fs0 mysql-server virtual-ip test
> 
> and fails to start, the group is not migrated to another node.
> 
> i have tried to add 
> 
> primitive test lsb:test op monitor interval=30s timeout=5s meta 
> migration-threshold=3
> 
> but it just stopped monitoring of test after 3 attempts.
> 
> any ideas how to achieve migration?

I read your email on the pacemaker list and from what you've shared and
explained, i cannot spot find a configuration issue. It should just work
like that (and does work like that for me).

Maybe post your entire configuration, preferrably a hb_report archive.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Juha Heinanen
Dominik Klein writes:

 > Heartbeat in v1 mode (haresources configuration) cannot do any resource
 > level monitoring itself. You'd need to do that externally by any
 > means.

yes, in v2 mode i have managed to make pacemaker to monitor resources,
for example, like this:

primitive test lsb:test \
op monitor interval="30s" timeout="5s" \
meta target-role="Started"

but i still have failed to find out how to make pacemaker to migrate
a resource group to another node if one of the resources in the group
fails to start.

for example, if test is the last member of group

group test-group fs0 mysql-server virtual-ip test

and fails to start, the group is not migrated to another node.

i have tried to add 

primitive test lsb:test op monitor interval=30s timeout=5s meta 
migration-threshold=3

but it just stopped monitoring of test after 3 attempts.

any ideas how to achieve migration?

-- juha
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
Les Mikesell wrote:
> My first HA setup is for a squid proxy where all I need is to move an IP
> address to a backup server if the primary fails (and the cache can just
> rebuild on its own).  This seems to work, but will only fail if the
> machine goes down completely or the primary IP is unreachable.  Is that
> typical or are there monitors for the service itself so failover would
> happen if the squid process is not running or stops accepting connections?
> 
> Second question (unrelated):  Can heartbeat be set up so one or two
> spare machines could automatically take over the IP address of any of a
> much larger pool of machines that might fail?
> 

Heartbeat in v1 mode (haresources configuration) cannot do any resource
level monitoring itself. You'd need to do that externally by any means.

If you're just starting out learning now, I'd suggest going with openais
and pacemaker instead of heartbeat right away. Check out the
documentation on www.clusterlabs.org/wiki/install and
www.clusterlabs.org/wiki/Documentation

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems