Re: [Linux-HA] samba lsb script

2008-08-01 Thread Andrew Beekhof
On Thu, Jul 31, 2008 at 19:51, Serge Dubrouski [EMAIL PROTECTED] wrote:

 One more thing to learn about Pacemaker :-) It looks like it runs
 monitor/status action for all configured resources before trying to
 start any of those resources. Then modifying that init script is your
 the only option.

100% correct
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] samba lsb script

2008-08-01 Thread Andrew Beekhof
On Thu, Jul 31, 2008 at 15:41, Thibaut Perrin [EMAIL PROTECTED] wrote:
 Why don't you put the samba and drbd resources in a resource group, as the
 samba will always be launched AFTER the drbd resource and filesystem ?

because we also check the status of resources _before_ we start
anything (doing so afterwards would defeat the point of doing so)
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource keep restarting on standby node

2008-08-01 Thread Andrew Beekhof
On Fri, Aug 1, 2008 at 03:35, jijun gao [EMAIL PROTECTED] wrote:
 hi, Andreas
very short interval and timeout
 *Jul 31 16:24:37 node2 last message repeated 9 times
Jul 31 16:24:37 node2 setroubleshoot:  SELinux is preventing ifconfig
(ifconfig_t) read write to socket:[136168] (initrc_t).  For complete
 SELinux messages. run sealert -l 0db84664-2bd3-4f8f-a10e-1e0641417484

hmmm ... I'm not familiar with SELinux, but that looks suspicious to
me. I assume on node1 SELinux is disabled?

 actually, on node1 SELinux is enabled, but I don't find similar log
 iinformation on node1,
 anyway, the two nodes don't have completely the same software environment,
 and I
 disable SELinux on node2.

 Jul 31 16:24:37 node2 lrmd: [29544]: WARN: asterisk_2:monitor process
 (PID
 23374) timed out (try 1).  Killing with signal SIGTERM (15).

... and because of the monitoring timeout the resource is declared
dead and restarted.

 you got it. when I set timeout=10, resources  don't restart as used to.
 but I am still not quite sure what timeout mean.

it means that the operation has 10s to complete before we assume it failed
and the operation is performed every {interval} seconds.

 here is my understanding:
 so the moniter action, actually, it's a process that run again and  again,
 and the process takes some time to execute, and every interval time,
 a new process runs. Is that true?

 still, there is something else I don't understand.
 why the 'restarting'   only happens on the standby node?
 (as far as I know, it has nothing to do with SELinux)

 Thanks a bounch
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] colocation constraint dependencies

2008-08-01 Thread Andrew Beekhof
On Wed, Jul 30, 2008 at 19:18, daniel peess [EMAIL PROTECTED] wrote:
 hello andreas,

 On Wed, Jul 30, 2008 at 11:04:10AM +0200, Andreas Kurz wrote:
 Ok .. I see. Try to set the 'default-resource-stickiness' to a
 positive value and give each of your groups a different 'priority'.
 That should do the trick.

 setting the 'default-resource-stickiness' to a positive value now prevents
 the restart behavior when a node returns, thanks.
 but this is only half of a workaround for the problem below.

 this doesn't help if you crash/standby/stop a node.
 resources that were running on this node are pushing away
 other resources, although both of their scores are equal/unset.
 if other free nodes are available the failing resource should start there,
 and if none are available shouldn't start at all.

 instead heartbeat restarts resources depending on the colocation
 constraints, although those should just distribute the resources across all
 nodes.

 again, all groups shall be treated equal if they have the same score,
 the failure of one shall never affect other ones.
 IMHO this is a bug.


then please submit one
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resources starting twice

2008-08-01 Thread Lars Marowsky-Bree
On 2008-07-31T16:44:31, Angel Rengifo Cancino [EMAIL PROTECTED] wrote:

 Yep, it's because I'm first trying to understand very well heartbeat
 1.x before learning 2.x style. Using haresources it seems easier for
 my simple requirements.

That's not necessarily helpful, as v2 is very different, and knowledge
from v1 almost does not apply at all.

The errors you have are very likely related to the LSB scripts not being
quite LSB compliant.

However, v1 and even v2 have some scenarios which might cause the
start or even stop action to be issued twice. The scripts must be
able to handle this, as they are defined to be idem-potent.


Regards,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resources starting twice

2008-08-01 Thread Angel Rengifo Cancino
Thanks Lars and Michael:

The squid script from Centos 5.2 it wasn't working correctly when
trying to start twice. I edited /etc/init.d/squid a now start twice
always returns me code 0.

Now heartbeat doesn't give up when tries to start an already running
service. I'll check every init script before using it with heartbeat.

mmm, this is a different question: Do I really need to start/stop
services with haresources? Why can't I just simply mantain my services
always running (chkconfig services on)? Is it not enough to change the
IP alias between nodes?

On Fri, Aug 1, 2008 at 5:00 AM, Lars Marowsky-Bree [EMAIL PROTECTED] wrote:
 On 2008-07-31T16:44:31, Angel Rengifo Cancino [EMAIL PROTECTED] wrote:

 Yep, it's because I'm first trying to understand very well heartbeat
 1.x before learning 2.x style. Using haresources it seems easier for
 my simple requirements.

 That's not necessarily helpful, as v2 is very different, and knowledge
 from v1 almost does not apply at all.

 The errors you have are very likely related to the LSB scripts not being
 quite LSB compliant.

 However, v1 and even v2 have some scenarios which might cause the
 start or even stop action to be issued twice. The scripts must be
 able to handle this, as they are defined to be idem-potent.


 Regards,
Lars

 --
 Teamlead Kernel, SuSE Labs, Research and Development
 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] mgmtd not starting on opensuse 11i386(unresolvedsymbol)

2008-08-01 Thread Sebastian Reitenbach
Dejan Muhamedagic [EMAIL PROTECTED] wrote: 
 Hi,
 
 On Wed, Jul 30, 2008 at 08:55:53AM +0200, Sebastian Reitenbach wrote:
  General Linux-HA mailing list linux-ha@lists.linux-ha.org wrote: 
   On Mon, Jul 28, 2008 at 05:52:10PM -, root wrote:
Hi,
Dejan Muhamedagic [EMAIL PROTECTED] wrote: 
 Hi,
 
 On Mon, Jul 28, 2008 at 04:41:27PM +0200, Sebastian Reitenbach 
wrote:
  Hi,
  
  I just upgraded my desktop to opensuse 11.0 i586, and updated 
the 
  box, 
then 
  installed the heartbeat rpm's 2.1.3 from download.opensuse.org.
  
  I've these rpm's installed right now:
  pacemaker-heartbeat-0.6.5-8.2
  heartbeat-common-2.1.3-23.1
  heartbeat-resources-2.1.3-23.1
  heartbeat-2.1.3-23.1
  pacemaker-pygui-1.4-1.3
  
  I've added these lines to /etc/ha.d/ha.cf to start mgmtd 
  automatically:
  apiauth mgmtd   uid=root
  respawn root/usr/lib/heartbeat/mgmtd -v
  
  but mgmtd fails to start, when I try to start it on the 
commandline, 
  then 
I 
  see the following output:
  
  /usr/lib/heartbeat/mgmtd: symbol lookup 
  error: /usr/lib/libpe_status.so.2: 
  undefined symbol: stdscr
  
  As far as I researched now, the stdscr symbol is expected to 
come 
  from 
  ncurses?
 
 Looks like a dependency problem. Does the package containing
 mgmtd depend on the ncurses library? Though I don't understand
 why mgmtd needs ncurses.
I found this out, in a thread in some m/l, regarding the error 
message 
  about 
the undefined symbol, but maybe this is just wrong.
   
   stdscr is an external variable defined in ncurses.h which is
   included from ./lib/crm/pengine/unpack.h which is part of the
   code that gets built in libpe_status. The pacemaker rpm, which
   includes that library, does depend on libncurses. Is that the
   case with the pacemaker you downloaded?
  I've these installed:
  rpm -qa | grep -i ncurs
  ncurses-utils-5.6-83.1
  libncurses5-5.6-83.1
  yast2-ncurses-pkg-2.16.14-0.1
  yast2-ncurses-2.16.27-8.1
  
  rpm -q --requires pacemaker-heartbeat
  /bin/sh
  /bin/sh
  /sbin/ldconfig
  /sbin/ldconfig
  rpmlib(PayloadFilesHavePrefix) = 4.0-1
  rpmlib(CompressedFileNames) = 3.0.4-1
  /bin/sh
  /usr/bin/python
  libbz2.so.1
  libc.so.6
  libc.so.6(GLIBC_2.0)
  libc.so.6(GLIBC_2.1)
  libc.so.6(GLIBC_2.1.3)
  libc.so.6(GLIBC_2.2)
  libc.so.6(GLIBC_2.3)
  libc.so.6(GLIBC_2.3.4)
  libc.so.6(GLIBC_2.4)
  libccmclient.so.1
  libcib.so.1
  libcrmcluster.so.1
  libcrmcommon.so.2
  libdl.so.2
  libgcrypt.so.11
  libglib-2.0.so.0
  libgnutls.so.26
  libgnutls.so.26(GNUTLS_1_4)
  libgpg-error.so.0
  libhbclient.so.1
  liblrm.so.0
  libltdl.so.3
  libm.so.6
  libncurses.so.5
  libpam.so.0
  libpam.so.0(LIBPAM_1.0)
  libpcre.so.0
  libpe_rules.so.2
  libpe_status.so.2
  libpengine.so.3
  libplumb.so.1
  librt.so.1
  libstonithd.so.0
  libtransitioner.so.1
  libxml2.so.2
  libz.so.1
  rpmlib(PayloadIsLzma) = 4.4.2-1
  
  
  rpm -ql libncurses5-5.6-83.1
  /lib/libncurses.so.5
  /lib/libncurses.so.5.6
  ...
  
  so it does require ncurses, but it is installed.
  
  
  but 
  nm /lib/libncurses.so.5.6
  nm: /lib/libncurses.so.5.6: no symbols
 
 That's fine, it means that the binary is stripped. If you take a
 look at libncurses.a (which is probably only in the development
 package), you should see some symbols. BTW, you can also try
 objdump with -T:
 
 $ objdump -T libncurses.so.5 | grep stdscr
 0015a630 gDO .bss   0008  Base stdscr


here I have:
objdump -T  /lib64/libncurses.so.5 | grep stdscr
002465e8 gDO .bss   0008  Basestdscr

Meanwhile I observed the problem on a opensuse 10.3 i386 and on opensue 11 
x86_64 too.

Seems like there is a general problem with this version.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi problems

2008-08-01 Thread Chun Tian (binghe)

Hi, Brock Palen

Good, finally there's someone got the same things as me. I just don't  
know if there's any chance the stonith/external would parse return  
value 0 into 256, or ipmitool itself have bugs when doing reset.


Brock, can I ask your machine type and model? I met some non-zero  
return values when using ipmitool reset on some HP Proliant DL140/145  
servers.


Regards,

Chun Tian (binghe)

I have made some luck getting STONITH to work but still running into  
a problem I can not figure out how to debug.


In the ha.cf  on each host I have:

stonith_host mds2.engin.umich.edu external/ipmi mds2.engin.umich.edu  
mds2-m.engin.umich.edu root PASSWORD
stonith_host mds1.engin.umich.edu external/ipmi mds1.engin.umich.edu  
mds1-m.engin.umich.edu root PASSWORD


Now heartbeat does try to kill the node where I kill heartbeat.  In  
the log I see:


heartbeat[12013]: 2008/07/31_15:47:56 info: Resetting node  
mds1.engin.umich.edu with [IPMI STONITH device]
heartbeat[12013]: 2008/07/31_15:47:57 info: glib: external_run_cmd:  
Calling '/usr/lib64/stonith/plugins/external/ipmi reset  
mds1.engin.umich.edu' returned 256
heartbeat[12013]: 2008/07/31_15:47:57 ERROR: glib:  
external_reset_req: 'ipmi reset' for host mds1.engin.umich.edu  
failed with rc 256


I can run:

stonith -t external/ipmi -p mds1.engin.umich.edu mds1- 
m.engin.umich.edu root PASSWORD -T reset mds1.engin.umich.edu


and the dead node will restart.  So from the documentation of 1.x  
style configs I am not sure where to debug why the stonith_host  
lines do not work.


mds1 and mds2 are the nodes of the cluster,  mds1-m and mds2-m are  
the hostnames of the IPMI devices which have lan configs set up.  
Note how stonith from the cmd line works just fine, just not in  
heartbeat.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi problems

2008-08-01 Thread Brock Palen
I did figure it out.  Problem is some of the docs out there are not  
very clear.  I will blog this at www.mlds-networks.com  If you need  
this latter, or just look in the mailing list archives.


The format is

stonith_host  HOST SENDING  external/ipmi  HOST TO CONTROL

So what I really needed:

stonith_host mds2.engin.umich.edu external/ipmi mds1.engin.umich.edu  
mds1-m.engin.umich.edu USER PASSWORD
stonith_host mds1.engin.umich.edu external/ipmi mds2.engin.umich.edu  
mds2-m.engin.umich.edu USER PASSWORD


Notice how the first host, second host and IPMI host are differnt.   
The first one tells mds2  how to kill mds1 using mds1-m IPMI device.

The second one tells mds1 how to kill mds2 using mds2-m etc.

I hope that helps.  Good luck,

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Aug 1, 2008, at 8:37 AM, Chun Tian (binghe) wrote:

Hi, Brock Palen

Good, finally there's someone got the same things as me. I just  
don't know if there's any chance the stonith/external would parse  
return value 0 into 256, or ipmitool itself have bugs when doing  
reset.


Brock, can I ask your machine type and model? I met some non-zero  
return values when using ipmitool reset on some HP Proliant  
DL140/145 servers.


Regards,

Chun Tian (binghe)

I have made some luck getting STONITH to work but still running  
into a problem I can not figure out how to debug.


In the ha.cf  on each host I have:

stonith_host mds2.engin.umich.edu external/ipmi  
mds2.engin.umich.edu mds2-m.engin.umich.edu root PASSWORD
stonith_host mds1.engin.umich.edu external/ipmi  
mds1.engin.umich.edu mds1-m.engin.umich.edu root PASSWORD


Now heartbeat does try to kill the node where I kill heartbeat.   
In the log I see:


heartbeat[12013]: 2008/07/31_15:47:56 info: Resetting node  
mds1.engin.umich.edu with [IPMI STONITH device]
heartbeat[12013]: 2008/07/31_15:47:57 info: glib:  
external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/ 
ipmi reset mds1.engin.umich.edu' returned 256
heartbeat[12013]: 2008/07/31_15:47:57 ERROR: glib:  
external_reset_req: 'ipmi reset' for host mds1.engin.umich.edu  
failed with rc 256


I can run:

stonith -t external/ipmi -p mds1.engin.umich.edu mds1- 
m.engin.umich.edu root PASSWORD -T reset mds1.engin.umich.edu


and the dead node will restart.  So from the documentation of 1.x  
style configs I am not sure where to debug why the stonith_host  
lines do not work.


mds1 and mds2 are the nodes of the cluster,  mds1-m and mds2-m are  
the hostnames of the IPMI devices which have lan configs set up.  
Note how stonith from the cmd line works just fine, just not in  
heartbeat.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems