[Linux-HA] Where did coros...@lists.osdl.org go?

2011-05-20 Thread Ulrich Windl
Hi!

The corosync-overview man page still has this address, but that seems to have 
gone:
lists.linux-foundation.org[140.211.169.51]
said: 550 5.1.1 coros...@lists.osdl.org... User unknown

Anybody knows the current address? I hope the project is not dead...

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: Re: DO NOT start using heartbeat 2.x in crm mode, but just use Pacemaker, please! [was: managing resource httpd in heartbeat]

2011-05-20 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.de schrieb am 19.05.2011 um 13:02 in 
 Nachricht
20110519110256.gl26...@suse.de:

[...]
 Of course. And while our esteemed SLES10 customers are still fully
 supported on our maintained 2.1.4-fixed version, I personally believe
 everyone should move swiftly to a newer code base (say, SLE HA 11 SP1).

No, we are waiting for SP2 ;-)

Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: Re: SBD and SFEX on one shared (aprtitioned) disk?

2011-05-20 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.de schrieb am 19.05.2011 um 13:03 in 
 Nachricht
20110519110338.gm26...@suse.de:
 On 2011-05-19T11:24:23, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
 wrote:
 
  Hi!
  
  From what I've read about SBD and SFEX, I could use one disk for both of 
 them, if SBD and SFEX get a partition on the disk. Right?
  Reason: The minimum of a disk on out SAN is 1GB, and it's quite wasteful to 
 have 1GB just for SBD. Doing some calculation, 1MB for SBD should be enough 
 for about any number of cluster nodes, and 900MB should be enough for more 
 than 1000 resources to control.
 
 Well, yes. I'm not quite sure why you'd want to use sfex though if you
 have sbd fencing anyway.

SBD is for node fencing only. If I need to ensure exclusive assignment of 
shared storage resources (well you never know what the cluster stuff tries to 
do) to avoid data corruption (e.g. through MD-RAID), I feel the need for 
cluster-wise mutex-locks.

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need HA Help - standby / online not switching automatically

2011-05-20 Thread Lars Ellenberg
On Thu, May 19, 2011 at 03:46:37PM -0700, Randy Katz wrote:
 To clarify, I was not seeking a quick response. I just noticed the 
 threads I searched
 were NEVER answered, with the problem that I reported. That being said 
 and about standby:
 
 Why does my node come up as standby and not as online?

Because you put it there.

The standby setting (as a few others) can take a lifetime,
and usually that defaults to forever, though you can explicitly
specify an until reboot, which actually means until restart of the
cluster system on that node.

 Is there a setting in my conf file that affects that?
 Or another issue, is it configuration, please advise.
 
 Thanks,
 Randy
 
 PS - Here are some threads were it seems they were never answered, one 
 going back 3 years ago:
 
 http://www.mail-archive.com/linux-ha@lists.linux-ha.org/msg09886.html
 http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg07663.html
 http://lists.community.tummy.com/pipermail/linux-ha/2008-August/034310.html

Then they probably have been solved off list, via IRC or support,
or by the original user finally having a facepalm experience.

Besides, yes, it happens that threads go unanswered, most of the time
because the question was badly asked (does not work. why?), and those
that could figure it out have been distracted by more important things,
or decided that, at that time, trying to figure it out was too time
consuming.

That's life.

If it happens to you, do a friendly bump,
and/or try to ask a smarter version of the question ;-)

Most of the time, the answer is in the logs, and the config.

But please break down the issue to a minimal configuration,
and post that minimal config plust logs of one incident.
Don't post your 2 MB xml config, plus a 2G log,
and expect people to dig through that for fun.

BTW, none of the quoted threads has anything to do with your experience,
afaiks.

 On 5/19/2011 3:16 AM, Lars Ellenberg wrote:
  On Wed, May 18, 2011 at 09:55:00AM -0700, Randy Katz wrote:
  ps - I searched a lot online and I see this issue coming up,
  I doubt that _this_ issue comes up that often ;-)
 
  and then after about 3-4 emails they request the resources and
  constraints and then there is never an answer to the thread, why?!
  Hey, it's not even a day since you provided the config.
  People have day jobs.
  People get _payed_ to do support on these kinds of things,
  so they probably first deal with requests by paying customers.
 
  If you need SLAs, you may need to check out a support contranct.
 
  Otherwise you need to be patient.
 
 
   From what I read, you probably just have misunderstood some concepts.
 
  Standby is not what I think you think it is ;-)
 
  Standby is NOT for deciding where resources will be placed.
 
  Standby is for manually switching a node into a mode where it WILL NOT
  run any resources. And it WILL NOT leave that state by itself.
  It is not supposed to.
 
  You switch a node into standby if you want to do maintenance on that
  node, do major software, system or hardware upgrades, or otherwise
  expect that it won't be useful to run resources there.
 
  It won't even run DRBD secondaries.
  It will run nothing there.
 
 
  If you want automatic failover, DO NOT put your nodes in standby.
  Because, if you do, they can not take over resources.
 
  You have to have your nodes online for any kind of failover to happen.
 
  If you want to have a preferred location for your resources,
  use location constraints.
 
 
  Does that help?
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: Re: Setting up SBD resources in SLES11

2011-05-20 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.de schrieb am 19.05.2011 um 13:15 in 
 Nachricht
20110519111526.gn26...@suse.de:
 On 2011-05-19T09:19:42, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
 wrote:
 
  Hi!
  
  I had the doubt that setting up the SBD resources is described correctly in 
 the High Availability Guide of SLES 11. My comment (to Novell I think) was: 
 Shouldn't here be a resource per node? Following the procedure, the resource 
 just starts on an arbitrary node. If one primitive per node, you'll need a 
 locational contraint to avoid multiple primitived running on the same node, 
 right?
 
 No, one external/sbd resource per device (which usually means: per
 cluster) is sufficient.
 
 And you do not need to clone it.

From the description:
sbd uses a shared storage device as a medium to communicate
fencing requests. This allows clusters without network power
switches; the downside is that access to the shared storage
device becomes a Single Point of Failure.

So the sbd resource distributes the fencing requests. Now what if the node 
where sdb runs is the minority (non-quorum)? How can the rest of the cluster 
tell the minority to fence (in case of a networking failure). AFAIK, as long as 
the storage is reachable, the sbd daemons will just be happy.

Maybe it's confusing that an sbd daemon runs on every node, but the sbd 
resource only runs on one node. Some more documentationg words might help here.

Regards,
Ulrich


 
  Another book uses a clone resource for SBD (which seems to make sense).
 
 No, it doesn't. ;-) What value does that provide?
 
  For all who don't have the test at hands, here's what the guide writes 
 about SBD setup (page 194):
  
  ---snip
  Configuring the Fencing Resource
  1 To complete the SBD setup, it is necessary to activate SBD as a 
 STONITH/fencing
  mechanism in the CIB as follows:
  crm configure
  crm(live)configure# property stonith-enabled=true
  crm(live)configure# property stonith-timeout=30s
  crm(live)configure# primitive stonith_sbd stonith:external/sbd params
  sbd_device=/dev/SBD
  crm(live)configure# commit
  crm(live)configure# quit
 
 Yes, and that's enough. The documentation is correct on this.
 
 
 Regards,
 Lars



 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need HA Help - standby / online not switching automatically

2011-05-20 Thread Randy Katz
Lars,

Thank you much for the answer on the standby issue.
It seems that that was the tip of my real issue. So now I have both nodes
coming online. And it seems ha1 starts fine with all the resources starting.

With them both online if I issue the: crm mode standby ha1.iohost.com

Then I see IP Takeover on ha2 but the other resources do not start, 
ever, it remains:

Node ha1.iohost.com (b159178d-c19b-4473-aa8e-13e487b65e33): standby
Online: [ ha2.iohost.com ]

  Resource Group: WebServices
  ip1(ocf::heartbeat:IPaddr2):   Started ha2.iohost.com
  ip1arp (ocf::heartbeat:SendArp):   Started ha2.iohost.com
  fs_webfs   (ocf::heartbeat:Filesystem):Stopped
  fs_mysql   (ocf::heartbeat:Filesystem):Stopped
  apache2(lsb:httpd):Stopped
  mysql  (ocf::heartbeat:mysql): Stopped
  Master/Slave Set: ms_drbd_mysql
  Slaves: [ ha2.iohost.com ]
  Stopped: [ drbd_mysql:0 ]
  Master/Slave Set: ms_drbd_webfs
  Slaves: [ ha2.iohost.com ]
  Stopped: [ drbd_webfs:0 ]

In looking in the recent log I see this: May 20 12:46:42 ha2.iohost.com 
pengine: [3117]: info: native_color: Resource fs_webfs cannot run anywhere

I am not sure why it cannot promote the other resources on ha2, I 
checked drbd before putting ha1 on standby and it was up to date. Here 
are the surrounding
log entries, the only thing I changed in the config is standby=off on 
both nodes:

May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: group_print:  
Resource Group: WebServices
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  ip1  (ocf::heartbeat:IPaddr2):   Started 
ha2.iohost.com
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  ip1arp   (ocf::heartbeat:SendArp):   Started 
ha2.iohost.com
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  fs_webfs (ocf::heartbeat:Filesystem):Stopped
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  fs_mysql (ocf::heartbeat:Filesystem):Stopped
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  apache2  (lsb:httpd):Stopped
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
native_print:  mysql(ocf::heartbeat:mysql): Stopped
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
Master/Slave Set: ms_drbd_mysql
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
short_print:  Slaves: [ ha2.iohost.com ]
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
short_print:  Stopped: [ drbd_mysql:0 ]
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
Master/Slave Set: ms_drbd_webfs
May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: 
short_print:  Slaves: [ ha1.iohost.com ha2.iohost.com ]
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
ip1arp: Breaking dependency loop at ip1
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
ip1: Breaking dependency loop at ip1arp
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource drbd_webfs:0 cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_webfs: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
fs_webfs: Rolling back scores from fs_mysql
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource fs_webfs cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource drbd_mysql:0 cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_mysql: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_mysql: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
fs_mysql: Rolling back scores from apache2
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource fs_mysql cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_mysql: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_webfs: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
apache2: Rolling back scores from mysql
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource apache2 cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: 
Resource mysql cannot run anywhere
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_mysql: Promoted 0 instances of a possible 1 to master
May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
ms_drbd_webfs: Promoted 0 instances of a possible 1 to master

Regards,
Randy

On 

Re: [Linux-HA] [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1

2011-05-20 Thread Tim Serong
On 19/05/11 00:43, Tim Serong wrote:
 Hi Everybody,

 This is to announce version 0.4.1 of Hawk, a web-based GUI for managing
 and monitoring Pacemaker High-Availability clusters.

 [...]

 Building an RPM for Fedora/Red Hat is still just as easy as last time:

 # hg clone http://hg.clusterlabs.org/pacemaker/hawk
 # cd hawk
 # hg update hawk-0.4.1
 # make rpm

*ahem*

It /would/ still be just as easy if I had said hg update tip, or, in 
this specific instance, hg update 398ae27386e (the Makefile grabs the 
last tag from hg to use as a version number, which is one commit *after* 
the actual tagged commit).

Regards,

Tim
-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Where did coros...@lists.osdl.org go?

2011-05-20 Thread Tim Serong
On 20/05/11 16:09, Ulrich Windl wrote:
 Hi!

 The corosync-overview man page still has this address, but that seems to have 
 gone:
 lists.linux-foundation.org[140.211.169.51]
  said: 550 5.1.1coros...@lists.osdl.org... User unknown

 Anybody knows the current address? I hope the project is not dead...

Sounds like a bug in the manpage.  That should be:

   open...@lists.osdl.org

(See http://corosync.org/doku.php?id=support)

Regards,

Tim
-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need HA Help - standby / online not switching automatically

2011-05-20 Thread Lars Ellenberg
On Thu, May 19, 2011 at 11:53:24PM -0700, Randy Katz wrote:
 Lars,
 
 Thank you much for the answer on the standby issue.
 It seems that that was the tip of my real issue. So now I have both nodes
 coming online. And it seems ha1 starts fine with all the resources starting.
 
 With them both online if I issue the: crm mode standby ha1.iohost.com

Why.

Learn about crm resource move.
(and unmove, for that matter).

 Then I see IP Takeover on ha2 but the other resources do not start, 
 ever, it remains:
 
 Node ha1.iohost.com (b159178d-c19b-4473-aa8e-13e487b65e33): standby
 Online: [ ha2.iohost.com ]
 
   Resource Group: WebServices
   ip1(ocf::heartbeat:IPaddr2):   Started ha2.iohost.com
   ip1arp (ocf::heartbeat:SendArp):   Started ha2.iohost.com
   fs_webfs   (ocf::heartbeat:Filesystem):Stopped
   fs_mysql   (ocf::heartbeat:Filesystem):Stopped
   apache2(lsb:httpd):Stopped
   mysql  (ocf::heartbeat:mysql): Stopped
   Master/Slave Set: ms_drbd_mysql
   Slaves: [ ha2.iohost.com ]
   Stopped: [ drbd_mysql:0 ]
   Master/Slave Set: ms_drbd_webfs
   Slaves: [ ha2.iohost.com ]
   Stopped: [ drbd_webfs:0 ]
 
 In looking in the recent log I see this: May 20 12:46:42 ha2.iohost.com 
 pengine: [3117]: info: native_color: Resource fs_webfs cannot run anywhere
 
 I am not sure why it cannot promote the other resources on ha2, I 
 checked drbd before putting ha1 on standby and it was up to date.

Double check the status of drbd:
# cat /proc/drbd

Check what the cluster would do, and why:
# ptest -LVVV -s
[add more Vs to see more detail, but brace yourself for maximum confusion ;-)]

Check for constraints that get in the way:
# crm configure show | grep -Ee 'location|order'

check the master scores in the cib:
# cibadmin -Ql -o status | grep master

Look at the actions that have been performed on the resource,
on both nodes:
   vv-- the ID of your primitive 
# grep lrmd:.*drbd_mysql /var/log/ha.log
or wherever that ends up on your box

 Here are the surrounding log entries, the only thing I changed in the
 config is standby=off on both nodes:
 
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: group_print:  
 Resource Group: WebServices
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 ip1  (ocf::heartbeat:IPaddr2):   Started ha2.iohost.com
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 ip1arp   (ocf::heartbeat:SendArp):   Started ha2.iohost.com
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 fs_webfs (ocf::heartbeat:Filesystem):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 fs_mysql (ocf::heartbeat:Filesystem):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 apache2  (lsb:httpd):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 mysql(ocf::heartbeat:mysql): Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
 Master/Slave Set: ms_drbd_mysql
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Slaves: [ ha2.iohost.com ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Stopped: [ drbd_mysql:0 ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
 Master/Slave Set: ms_drbd_webfs
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Slaves: [ ha1.iohost.com ha2.iohost.com ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
 ip1arp: Breaking dependency loop at ip1
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: ip1: 
 Breaking dependency loop at ip1arp

You got a dependency loop?
Maybe you should fix that?

You put some things in a group in a specific order, then you specify the
reverse order in an explicit order and colocation constraint. That is not
particularly useful. Either use a group, or use explicit order/colocation
constraints, don't try to use both for the same resources.

But that's nothing that would affect DRBD at this point.
And as long as your DRBD is not (or can not?) be promoted,
nothing that depends on it will run, obviously.

 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: Resource 
 drbd_webfs:0 cannot run anywhere
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
 ms_drbd_webfs: Promoted 0 instances of a possible 1 to master
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
 fs_webfs: Rolling back scores from fs_mysql
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: Resource 
 fs_webfs cannot run anywhere
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: native_color: Resource 
 drbd_mysql:0 cannot run anywhere
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: master_color: 
 ms_drbd_mysql: 

[Linux-HA] SLES11 SP1: bug in crm shell completion

2011-05-20 Thread Ulrich Windl
Hi!

The crm shell of SLES11 SP1 has the following auto-completion bug: After 
defining new primitives in crm configure, the new primitives don't show in 
completion after commit until the configure context is re-entered (e.g. by 
up, configure).

While taking about completion: If I enter del foo, and move the cursor back 
behind the 'l' of del, crm doesn't complete the command (to delete) as long 
as theres another argument right of the cursor. In Bash similar completion 
works. Can it be implemented in crm shell as well?

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SLES11 SP1: bug in crm shell completion

2011-05-20 Thread Lars Ellenberg
On Fri, May 20, 2011 at 01:32:11PM +0200, Ulrich Windl wrote:
 Hi!
 
 The crm shell of SLES11 SP1 has the following auto-completion bug:
 After defining new primitives in crm configure, the new primitives
 don't show in completion after commit until the configure context is
 re-entered (e.g. by up, configure).
 
 While taking about completion: If I enter del foo, and move the
 cursor back behind the 'l' of del, crm doesn't complete the command
 (to delete) as long as theres another argument right of the cursor.
 In Bash similar completion works. Can it be implemented in crm shell
 as well?

Sure.
Patches accepted ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need HA Help - standby / online not switching automatically

2011-05-20 Thread Randy Katz
Hi Lars,

Thank you for the tools to look at things, however, on a whim before 
getting into them as
DRBD was looking fine in that scenario I decided to just run through the 
install on a different
pair of VMs and making sure I used the gitco.de repository when it came 
to drbd83 and the
clusterlabs repo for pacemaker (heartbeat and everything comes with it 
once the libesmtp
requirement is settled, in this case by using a later epel install: rpm 
-ivH epel-release-5-4.noarch.rpm):

So using the exact same configuration in crm except standby is off on 
both VMs of course, when
I do the same crm node standby on one the other takes over and then back 
again, no problem. I am
going to go back and either reinstall the other and/or compare each and 
every rpm and source to see
which is broken or just store my install procedure.

Now off to learn what you mentioned about crm resource move, thanks again.

Regards,
Randy


On 5/20/2011 1:03 AM, Lars Ellenberg wrote:
 On Thu, May 19, 2011 at 11:53:24PM -0700, Randy Katz wrote:
 Lars,

 Thank you much for the answer on the standby issue.
 It seems that that was the tip of my real issue. So now I have both nodes
 coming online. And it seems ha1 starts fine with all the resources starting.

 With them both online if I issue the: crm mode standby ha1.iohost.com
 Why.

 Learn about crm resource move.
 (and unmove, for that matter).

 Then I see IP Takeover on ha2 but the other resources do not start,
 ever, it remains:

 Node ha1.iohost.com (b159178d-c19b-4473-aa8e-13e487b65e33): standby
 Online: [ ha2.iohost.com ]

Resource Group: WebServices
ip1(ocf::heartbeat:IPaddr2):   Started ha2.iohost.com
ip1arp (ocf::heartbeat:SendArp):   Started ha2.iohost.com
fs_webfs   (ocf::heartbeat:Filesystem):Stopped
fs_mysql   (ocf::heartbeat:Filesystem):Stopped
apache2(lsb:httpd):Stopped
mysql  (ocf::heartbeat:mysql): Stopped
Master/Slave Set: ms_drbd_mysql
Slaves: [ ha2.iohost.com ]
Stopped: [ drbd_mysql:0 ]
Master/Slave Set: ms_drbd_webfs
Slaves: [ ha2.iohost.com ]
Stopped: [ drbd_webfs:0 ]

 In looking in the recent log I see this: May 20 12:46:42 ha2.iohost.com
 pengine: [3117]: info: native_color: Resource fs_webfs cannot run anywhere

 I am not sure why it cannot promote the other resources on ha2, I
 checked drbd before putting ha1 on standby and it was up to date.
 Double check the status of drbd:
 # cat /proc/drbd

 Check what the cluster would do, and why:
 # ptest -LVVV -s
 [add more Vs to see more detail, but brace yourself for maximum confusion ;-)]

 Check for constraints that get in the way:
 # crm configure show | grep -Ee 'location|order'

 check the master scores in the cib:
 # cibadmin -Ql -o status | grep master

 Look at the actions that have been performed on the resource,
 on both nodes:
 vv-- the ID of your primitive
 # grep lrmd:.*drbd_mysql /var/log/ha.log
 or wherever that ends up on your box

 Here are the surrounding log entries, the only thing I changed in the
 config is standby=off on both nodes:

 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: group_print:  
 Resource Group: WebServices
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 ip1  (ocf::heartbeat:IPaddr2):   Started ha2.iohost.com
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 ip1arp   (ocf::heartbeat:SendArp):   Started ha2.iohost.com
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 fs_webfs (ocf::heartbeat:Filesystem):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 fs_mysql (ocf::heartbeat:Filesystem):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 apache2  (lsb:httpd):Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: native_print:  
 mysql(ocf::heartbeat:mysql): Stopped
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
 Master/Slave Set: ms_drbd_mysql
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Slaves: [ ha2.iohost.com ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Stopped: [ drbd_mysql:0 ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: clone_print:  
 Master/Slave Set: ms_drbd_webfs
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: notice: short_print:  
 Slaves: [ ha1.iohost.com ha2.iohost.com ]
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
 ip1arp: Breaking dependency loop at ip1
 May 20 12:47:06 ha2.iohost.com pengine: [3117]: info: rsc_merge_weights: 
 ip1: Breaking dependency loop at ip1arp
 You got a dependency loop?
 Maybe you should fix that?

 You put some things in a group in a specific order, then you specify the
 reverse order in an explicit order and colocation