Re: [Pacemaker] Unable to execute crm(heartbeat/pacemaker) commands

2011-08-25 Thread Dejan Muhamedagic
Hi,

On Tue, Aug 23, 2011 at 11:15:09AM +0530, rakesh k wrote:
> Hi
> 
> I am using Heartbeat(3.0.3) and pacemaker (1.0.9).
> 
> We are facing the following issue. Please find the details.
> 
> we had installed heartbeat and pacemaker,on the uinux BOX(CENT OS operation
> system).
> 
> we had created a ssh user and provided it to one of the developers.
> please find the directory structure and the bash profile for that ssh user.
> 
> bash-3.2# cat .bash_profile
> # .bash_profile
> # User specific environment and startup programs
> PATH=$PATH:/usr/sbin
> export PATH
> bash-3.2#
> but when one of the developer logs in to the box where heartbeat/pacemaker
> is
> installed through ssh .
> he is unable to execute crm configuration commands.
> say for example. while we are executing the following crm configuration
> commands .
> we are unable to execute crm configuration commands and the system is
> hanging
> while executing.

What is hanging? The crm shell? Does it react to ctrl-C? Can you
provide more details.

> Please find the crm configuration command we are using and the snapshot of
> the bash prompt while executing
> 
> -bash-3.2$ crm configure primitive HttpdVIP ocf:heartbeat:IPaddr3 \
> params ip="10.104.231.78" eth_num="eth0:2"
> vip_cleanup_file="/var/run/bigha.pid" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> > params ip="10.104.231.78" eth_num="eth0:2"
> vip_cleanup_file="/var/run/bigha.pid" \
> op monitor interval="30s" > op start interval="0"
> timeout="120s" \
> > op stop interval="0" timeout="120s" \
> > op monitor interval="30s"

Do you actually type all this on the command line? Why would you
want to do that, why not use a file. There's no telling if and
how shell expansion would affect this.

Thanks,

Dejan

> can you please help me on this particular sceanrio.

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker does not support the '(null)' cluster infrastructure.

2011-08-25 Thread ihjaz Mohamed
Facing the same issue with pacemaker-1.1.5 also. Does it mean the new versions 
of the Pacemaker no longer support Heartbeat.

--- On Thu, 25/8/11, Andreas Kurz  wrote:

From: Andreas Kurz 
Subject: Re: [Pacemaker] Pacemaker does not support the '(null)' cluster 
infrastructure.
To: pacemaker@oss.clusterlabs.org
Date: Thursday, 25 August, 2011, 3:20 AM

On 08/24/2011 05:35 PM, ihjaz Mohamed wrote:
> Hi All,
> 
> Am using heartbeat-3.0.4 with pacemaker-1.1.2-7 on RHEL 6.

that pacemaker version only supports corosync/cman as ccm ... and
consider an update to latest 1.1.5

Regards,
Andreas

> 
> When I start the heartbeat service am getting the following in the log:
> 
> ---
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: info: do_cib_control: Could
> not connect to the CIB service: connection failed
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: WARN: do_cib_control:
> Couldn't complete CIB registration 30 times... pause and retry
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_cib_control:
> Could not complete CIB registration  30 times... hard error
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_log: FSA: Input
> I_ERROR from do_cib_control() received in state S_STARTING
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: info: do_state_transition:
> State transition S_STARTING -> S_RECOVERY [ input=I_ERROR
> cause=C_FSA_INTERNAL origin=do_cib_control ]
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_recover: Action
> A_RECOVER (0100) not supported
> 
> Aug 24 20:47:36 aceblr075.com crmd: [17206]: CRIT: get_cluster_type:
> This installation of Pacemaker does not support the '(null)' cluster
> infrastructure.  Terminating.
> 
> Aug 24 20:47:36 aceblr075.com heartbeat: [17188]: WARN: Managed
> /usr/lib64/heartbeat/crmd process 17206 exited with return code 100.
> ---
> 
> 
> Below is my ha.cf :
> 
> 
> debugfile /var/log/ha-debug
> 
> logfile /var/log/ha-log
> 
> logfacility local0
> 
> keepalive 2
> 
> deadtime 10
> 
> warntime 5
> 
> initdead 20
> udpport 694
> ucast eth1 10.10.10.2
> auto_failback off
> 
> node aceblr075.com
> node aceblr076.com
> 
> # Enable Pacemaker
> crm respawn
> 
> 
> 
> Am I doing something wrong here?
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



-Inline Attachment Follows-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker does not support the '(null)' cluster infrastructure.

2011-08-25 Thread Andreas Kurz
On 2011-08-25 10:41, ihjaz Mohamed wrote:
> Facing the same issue with pacemaker-1.1.5 also. Does it mean the new
> versions of the Pacemaker no longer support Heartbeat.

No, it means Redhat compiles its version without Heartbeat support.

Regards,
Andreas

> 
> --- On *Thu, 25/8/11, Andreas Kurz //* wrote:
> 
> 
> From: Andreas Kurz 
> Subject: Re: [Pacemaker] Pacemaker does not support the '(null)'
> cluster infrastructure.
> To: pacemaker@oss.clusterlabs.org
> Date: Thursday, 25 August, 2011, 3:20 AM
> 
> On 08/24/2011 05:35 PM, ihjaz Mohamed wrote:
> > Hi All,
> >
> > Am using heartbeat-3.0.4 with pacemaker-1.1.2-7 on RHEL 6.
> 
> that pacemaker version only supports corosync/cman as ccm ... and
> consider an update to latest 1.1.5
> 
> Regards,
> Andreas
> 
> >
> > When I start the heartbeat service am getting the following in the
> log:
> >
> > ---
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: info: do_cib_control:
> Could
> > not connect to the CIB service: connection failed
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: WARN: do_cib_control:
> > Couldn't complete CIB registration 30 times... pause and retry
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_cib_control:
> > Could not complete CIB registration  30 times... hard error
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_log: FSA: Input
> > I_ERROR from do_cib_control() received in state S_STARTING
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: info:
> do_state_transition:
> > State transition S_STARTING -> S_RECOVERY [ input=I_ERROR
> > cause=C_FSA_INTERNAL origin=do_cib_control ]
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: ERROR: do_recover: Action
> > A_RECOVER (0100) not supported
> >
> > Aug 24 20:47:36 aceblr075.com crmd: [17206]: CRIT: get_cluster_type:
> > This installation of Pacemaker does not support the '(null)' cluster
> > infrastructure.  Terminating.
> >
> > Aug 24 20:47:36 aceblr075.com heartbeat: [17188]: WARN: Managed
> > /usr/lib64/heartbeat/crmd process 17206 exited with return code 100.
> > ---
> >
> >
> > Below is my ha.cf :
> >
> > 
> > debugfile /var/log/ha-debug
> >
> > logfile /var/log/ha-log
> >
> > logfacility local0
> >
> > keepalive 2
> >
> > deadtime 10
> >
> > warntime 5
> >
> > initdead 20
> > udpport 694
> > ucast eth1 10.10.10.2
> > auto_failback off
> >
> > node aceblr075.com
> > node aceblr076.com
> >
> > # Enable Pacemaker
> > crm respawn
> >
> > 
> >
> > Am I doing something wrong here?
> >
> >
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> 
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> -Inline Attachment Follows-
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Unable to execute crm(heartbeat/pacemaker) commands

2011-08-25 Thread Lars Ellenberg
On Thu, Aug 25, 2011 at 10:34:07AM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Aug 23, 2011 at 11:15:09AM +0530, rakesh k wrote:
> > Hi
> > 
> > I am using Heartbeat(3.0.3) and pacemaker (1.0.9).
> > 
> > We are facing the following issue. Please find the details.
> > 
> > we had installed heartbeat and pacemaker,on the uinux BOX(CENT OS operation
> > system).
> > 
> > we had created a ssh user and provided it to one of the developers.
> > please find the directory structure and the bash profile for that ssh user.
> > 
> > bash-3.2# cat .bash_profile
> > # .bash_profile
> > # User specific environment and startup programs
> > PATH=$PATH:/usr/sbin
> > export PATH
> > bash-3.2#
> > but when one of the developer logs in to the box where heartbeat/pacemaker
> > is
> > installed through ssh .
> > he is unable to execute crm configuration commands.
> > say for example. while we are executing the following crm configuration
> > commands .
> > we are unable to execute crm configuration commands and the system is
> > hanging
> > while executing.
> 
> What is hanging? The crm shell? Does it react to ctrl-C? Can you
> provide more details.

My guess is that the shell prompt is "hanging".
Why?

Because you end the last part of the input with backslash.
Which of course causes shell to wait for yet an other line.

And if you don't type that line (or an additional return)
that shell prompt will wait for a very long time.

If that guess should turn out to be true, I suggest you
sleep more, drink more water or tea or coffee or whatever helps,

Or first learn about shell and do some *nix systems 101 in general
before trying to do cluster stuff.

> 
> > Please find the crm configuration command we are using and the snapshot of
> > the bash prompt while executing
> > 
> > -bash-3.2$ crm configure primitive HttpdVIP ocf:heartbeat:IPaddr3 \
> > params ip="10.104.231.78" eth_num="eth0:2"
> > vip_cleanup_file="/var/run/bigha.pid" \
> > op start interval="0" timeout="120s" \
> > op stop interval="0" timeout="120s" \
> > > params ip="10.104.231.78" eth_num="eth0:2"
> > vip_cleanup_file="/var/run/bigha.pid" \
> > op monitor interval="30s" > op start interval="0"
> > timeout="120s" \
> > > op stop interval="0" timeout="120s" \
> > > op monitor interval="30s"
> 
> Do you actually type all this on the command line? Why would you
> want to do that, why not use a file. There's no telling if and
> how shell expansion would affect this.
> 
> Thanks,
> 
> Dejan
> 
> > can you please help me on this particular sceanrio.
> 
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Unable to execute crm(heartbeat/pacemaker) commands

2011-08-25 Thread Lars Ellenberg
On Thu, Aug 25, 2011 at 11:05:32AM +0200, Lars Ellenberg wrote:
> My guess is that the shell prompt is "hanging".
> Why?
> 
> Because you end the last part of the input with backslash.
> Which of course causes shell to wait for yet an other line.
> 
> And if you don't type that line (or an additional return)
> that shell prompt will wait for a very long time.
> 
> If that guess should turn out to be true, I suggest you
> sleep more, drink more water or tea or coffee or whatever helps,
> 
> Or first learn about shell and do some *nix systems 101 in general
> before trying to do cluster stuff.

Then again, if it is something completely different,
I apologize for being impertinent ...

Lars

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] property default-resource-stickiness vs. rsc_defaults resource-stickiness

2011-08-25 Thread Brian J. Murrell
I've seen both of setting a default-resource-stickiness property (i.e.
http://www.howtoforge.com/installation-and-setup-guide-for-drbd-openais-pacemaker-xen-on-opensuse-11.1)
and a rsc_defaults option with resource-stickiness
(http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html)
as solutions to preventing auto-failback.

I wonder what the difference between the two is and which one is
considered better/more correct.

Cheers,
b.



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.

2011-08-25 Thread mark - pacemaker list
Hello,

On Mon, Aug 22, 2011 at 2:55 AM, ihjaz Mohamed wrote:

> Hi,
>
> Has any one here come across this issue?.
>
>

Sorry for the delay, but I wanted to respond and let you know that I'm also
having this issue.  I can pretty reliably kill a pretty simple cluster setup
by rebooting one of the nodes.  When the rebooted node comes back up and
starts pacemaker, it instantly tries to start all services on itself,
ignoring that they're running happily and healthily on the other node and
resource stickiness is configured at 1000.  The result is none of the
resources running anywhere... they become unmanaged and crm status shows
that it thinks they are running on the freshly rebooted node.  If pacemaker
can be configured for a delay on startup before it tries to run services, I
think even 5 seconds of time would be enough for it to realize that it
should definitely not start anything at all.  I haven't been able to find a
setting that accomplishes that, though.

The cluster is a pretty simple one, trying to test the VirtualDomain RA
which in and of itself has given me fits (empty state files... why do they
get emptied rather than removed, which prevents the VM from starting until
you manually re-populate the state file no matter how many 'resource
cleanup' attempts you make?), but that is for another troubleshooting
session.  This problem is my biggie, because a healthy surviving node has
all resources forced off of it and killed by a rebooted one.

Has anybody else been running into this, or are we just two unlucky fellas?

This is currently on CentOS 6.0 with all updates (had the same issue on
Scientific Linux 6.1 so rolled back onto CentOS for consistency since all
other machines here are on it).  Both 'cman' and 'pacemaker' configured to
start at boot.  I'll throw cluster.conf and 'crm configure show' output on
the end of this in case it'll help someone spot a glaring mistake on my part
(which I'd love it to be at this point, as that is easily fixed).

Regards,
Mark



















node kvm1
node kvm2
primitive apache1 ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/apache1.xml" \
meta allow-migrate="true" is-managed="true" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="90" \
op migrate_to interval="0" timeout="120" \
op migrate_from interval="0" timeout="60"
primitive fw1 ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/fw1.xml" \
meta allow-migrate="true" is-managed="true" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="90" \
op migrate_to interval="0" timeout="120" \
op migrate_from interval="0" timeout="60"
primitive vgClusterDisk ocf:heartbeat:LVM \
params volgrpname="vgClusterDisk" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="120"
clone shared_volgrp vgClusterDisk \
meta target-role="Started" is-managed="true"
order storage_then_VMs inf: shared_volgrp ( fw1 apache1 )
property $id="cib-bootstrap-options" \
dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-action="reboot" \
stonith-timeout="30s" \
maintenance-mode="false" \
pe-error-series-max="5000" \
pe-warn-series-max="5000" \
pe-input-series-max="5000" \
dc-deadtime="2min" \
stonith-enabled="false" \
last-lrm-refresh="1314294568"
rsc_defaults $id="rsc-options" \
resource-stickiness="1000"




> --- On *Wed, 17/8/11, ihjaz Mohamed * wrote:
>
>
> From: ihjaz Mohamed 
> Subject: [Pacemaker] How to prevent a node that joins the cluster after
> reboot from starting the resources.
> To: pacemaker@oss.clusterlabs.org
> Date: Wednesday, 17 August, 2011, 12:23 PM
>
> Hi All,
>
> Am getting an unmanaged error as shown below when one of the node is
> rebooted and comes back to join the cluster.
>
> *Online: [ aceblr101.com aceblr107.com ]
>
>  Resource Group: HAService
>  FloatingIP (ocf::heartbeat:IPaddr2):   Started 
> aceblr107.com(unmanaged) FAILED
>  acestatus  (lsb:acestatus):Stopped
>  Clone Set: pingdclone
>  Started: [ aceblr101.com aceblr107.com ]
>
> Failed actions:
> FloatingIP_stop_0 (node=aceblr107.com, call=7, rc=1, status=complete):
> unknown error
> *Below is my configuration:*node
> $id="8bf8e613-f63c-43a6-8915-4b2dbf72a4a5" aceblr101.com
> node $id="bde62a1f-0f29-4357-a988-0e26bb06c4fb" aceblr107.com
> primitive FloatingIP ocf:heartbeat:IPaddr2 \
> params ip="xx.xxx.xxx.xxx" nic="eth0:0"
> primitive acestatus lsb:acestatus \
> op start interval="30"
> primitive pingd ocf:pacemaker:pingd \
> params host_list="xx.xxx.xxx.1" multiplier="100" \
> op monitor interval="15s" timeout="5s"
> group HAService FloatingIP acestatus \
> meta target-role="Started"
> clone pingdclone pingd \
> meta globally-unique="false"
> location ip1_location FloatingIP \
> rule $id="ip1_location-rule"

Re: [Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.

2011-08-25 Thread mark - pacemaker list
Hello again,

Replying to my own message with a "for the archives" post, my issue with
services being started concurrently after a node reboot came down to the
fact that I'm using the VirtualDomain RA, but by default CentOS 6.0 and
Scientific Linux 6.1 (and presumably RHEL6 as well) start libvirtd as one of
the very last services, after pacemaker has already been fired up.  The
VirtualDomain RA does some initial monitoring checks when pacemaker starts,
but gets a "connection refused" error since libvirtd isn't running yet.
 This appears to be what causes the empty state files in
/var/run/heartbeat/rsctmp/VirtualDomain-.state and also to trigger the
starting of the guest VMs even though they're actually running on the other
node.

So, you go from a good state:
--

Last updated: Thu Aug 25 14:35:49 2011
Stack: cman
Current DC: kvm2 - partition WITHOUT quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.


Online: [ kvm2 ]
OFFLINE: [ kvm1 ]

fw1 (ocf::heartbeat:VirtualDomain): Started kvm2
 Clone Set: shared_volgrp
 Started: [ kvm2 ]
 Stopped: [ vgClusterDisk:0 ]
apache1 (ocf::heartbeat:VirtualDomain): Started kvm2
--



To a not-good state for about a minute:
--

Last updated: Thu Aug 25 14:38:06 2011
Stack: cman
Current DC: kvm2 - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.


Online: [ kvm1 kvm2 ]

fw1 (ocf::heartbeat:VirtualDomain) Started [kvm1kvm2 ]
 Clone Set: shared_volgrp
 Started: [ kvm1 kvm2 ]
apache1 (ocf::heartbeat:VirtualDomain) Started [kvm1kvm2 ]

Failed actions:
fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error
fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error
apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown
error
apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error
--


And then if finallly settles on this state, where it thinks the VMs are
running on the freshly booted node and unmanaged but they're actually dead
everywhere:
--

Last updated: Thu Aug 25 14:38:13 2011
Stack: cman
Current DC: kvm2 - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.


Online: [ kvm1 kvm2 ]

fw1 (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED
 Clone Set: shared_volgrp
 Started: [ kvm1 kvm2 ]
apache1 (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED

Failed actions:
fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error
fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error
apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown
error
apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error
--


At this point, starting a VMs is impossible regardless of any attempts you
make with 'resource cleanup' or 'resource manage'.  You have to manually
echo the name of the domain into its state file, then do a cleanup, and
everything will start.

So, I disabled pacemaker from starting via the normal route and just added a
line to /etc/rc.local that starts it, since that's the absolute last thing
done at boot.  I didn't want to mess with the chkconfig settings in the init
script and get bitten by this down the line somewhere after an update that
replaced the init script.  Now libvirtd is there for pacemaker and things
behave as expected, at least after three reboot tests in a row.

I suppose it may also work to make libvirtd a pacemaker resource, with an
order constraint so it's started before any VMs are ever probed/started.
 That'd take away easy/painless restarts of libvirtd, though.  I'll have to
do some further digging to see what makes the most sense.

Anyhow, sorry for the noise on the list, but I always hate it when someone
posts a problem, then either disappears forever or replies back to the list
with, "Nevermind, fixed it!" and no explanation.

Regards,
Mark

--- 8< -- snipped everything else, this is too long as it is ---
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/li

Re: [Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.

2011-08-25 Thread Vladislav Bogdanov
25.08.2011 23:22, mark - pacemaker list wrote:
[snip]
> So, I disabled pacemaker from starting via the normal route and just
> added a line to /etc/rc.local that starts it, since that's the absolute
> last thing done at boot.  I didn't want to mess with the chkconfig
> settings in the init script and get bitten by this down the line
> somewhere after an update that replaced the init script.  Now libvirtd
> is there for pacemaker and things behave as expected, at least after
> three reboot tests in a row.

Pacemaker now (1.1) starts at level 99. That was discussed in ML.

> 
> I suppose it may also work to make libvirtd a pacemaker resource, with
> an order constraint so it's started before any VMs are ever
> probed/started.  That'd take away easy/painless restarts of libvirtd,
> though.  I'll have to do some further digging to see what makes the most
> sense.

Please do not forget that libvirt may start guests by itself, even if
libvirtd is started by pacemaker. This is true for at least latest
libvirt releases - it starts all guests marked for autostart AND all
guests being started when libvirtd was last active (f.e. just before
node was fenced).
The only solution I found to prevent this is to delete all
/etc/libvirt/qemu and /var/.../libvirt files early during boot sequence.

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker