Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Bjorn Oglefjorn Mon, 02 Apr 2007 07:10:16 -0700

Any ideas as to what's going wrong here?
--BO

On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:


I've made the OCF apache RA work by editing the script's parameters for
now.  This is just testing anyway.  Attached are my configs and a tar ball
of the logs from the two nodes in question.  The logs show one complete run
of heartbeat..from start to stop.  What I did during that time is as
follows:

1. Start heartbeat
2. Wait for deadtime to expire and resources to start
3. Simulate node failure on test-2 by shutting down networking (via
console)
4. Watch as STONITH fails repeatedly
5. Start networking on test-2
6. Watch cluster recover and resources move to test-1
7. Stop heartbeat

I've made some changes to my cib.xml recently.  The largest change is that
I've made my STONITH declarations normal primitive directives instead of
clones.  The reason being is that each STONITH device needs a unique
definition within the CIB.  A DRAC is embedded in the chassis of each node
and can only work on that particular node (eg: test-1.drac can only reset
test-1.domain).  I don't want test-1_DRAC being run on the node
test-1.domain as a reset operation would then result in suicide (is this
ever desirable?).

Normal resource recovery and failover is working as expected as far as I
can tell.  I'm only having a problem with STONITH.

I'm struggling to see where I've gone wrong.  Please help me figure this
out as it's integral to several projects I have on my plate at this time.
Thanks again; you've all been great.

--BO


On 3/30/07, Bjorn Oglefjorn <[EMAIL PROTECTED]> wrote:
>
> I took a look at the apache RA, but it makes a lot of assumptions about
> the environment which are mostly untrue in Red Hat.  How can I configure
> this RA short of making changes to the script?  Can I set environmental
> variables?  I tried setting what's shown in the 'meta-data' output, but with
> no luck.
>
> Thanks as always,
> --BO
>
> On 3/29/07, Alan Robertson < [EMAIL PROTECTED]> wrote:
> >
> > Bjorn Oglefjorn wrote:
> > > Thanks for the reply Dejan.  My responses are inline.
> > > --BO
> > >
> > > On 3/28/07, Dejan Muhamedagic < [EMAIL PROTECTED]> wrote:
> > >>
> > >> On Wed, Mar 28, 2007 at 11:29:35AM -0400, Bjorn Oglefjorn wrote:
> > >> > I believe I've corrected some issues, but now I'm getting more of
> > this:
> > >> > Mar 28 11:02:37 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:monitor
> >
> > >> (process
> > >> > 24472) failed to redirect stdout for its background child
> > (daemon)
> > >> > processes. This will likely cause those processes to die
> > >> mysteriously at
> > >> > some later time (terminated by signal SIGPIPE).
> > >>
> > >> Hmm, I think that this has been addressed as Alan had already
> > >> pointed out, probably after the 2.0.7 release. If you can, please
> > >> upgrade to 2.0.8.
> > >
> > >
> > > I'd prefer to stick with the package that comes from CentOS extras (
> > 2.0.7).
> > > I don't get this error all the time, so I'm not sure why it's
> > happening.
> > > Can someone give me a deeper explanation of what the lrmd doesn't
> > like
> > > here?
> > >
> > >> When I attempt to move resources to another node (useing
> > crm_standby) I
> > >> get
> > >> > these errors:
> > >> > Mar 28 10:56:04 test-1 crmd: [22011]: info:
> > >> do_lrm_rsc_op:lrm.cPerforming
> > >> > op stop on httpd (interval=0ms,
> > >> key=28:66532759-6190-4321-9be3-07730b15aeae)
> > >> > Mar 28 10:56:04 test-1 lrmd: [22773]: WARN: For LSB init script,
> > no
> > >> > additional parameters are needed.
> > >>
> > >> Can't say unless you show me this rsc definition, but it seems
> > >> like bad usage. I found one below, but that one should not cause
> > >> this problem:
> > >
> > >
> > > It's slightly different now (is provider="heartbeat" bad here?):
> > >
> > >         <primitive class="lsb" id="httpd" provider="heartbeat"
> > > type="httpd-lsb">
> > >           <operations>
> > >             <op id="httpd_mon" interval="5s" name="monitor"
> > timeout="20s"
> > > on_fail="restart"/>
> > >             <op id="httpd_start" name="start" timeout="20s"
> > > on_fail="restart" prereq="fencing"/>
> > >             <op id="httpd_stop" name="stop" timeout="20s"
> > on_fail="restart"
> > > prereq="fencing"/>
> > >           </operations>
> > >         </primitive>
> > >
> > >> <primitive class="lsb" id="httpd" provider="heartbeat"
> > type="httpd">
> > >> > <operations>
> > >> > <op id="httpd_status" interval="5s" name="status" timeout="20s"
> > >> on_fail="fence"/>
> > >> > </operations>
> > >> > </primitive>
> > >>
> > >> One thing that looks odd is 5s interval and 20s timeout. The
> > >> timeout is probably OK, but the interval is a bit exaggerated.
> > >> What I mean is that, apart from putting extra strain on your host
> > >> which may or may not be an issue, a 5 seconds monitoring interval
> > >> won't bring you much, or, in other words, how about your response
> > >> time in case a problem occurs? Is it of the same order?
> > >
> > >
> > > Would it make more sense to have the timeout and interval equal?  I
> > can see
> > > your point.
> > >
> > >> Mar 28 10:56:04 test-1 lrmd: [22008]: ERROR: RA lsb:httpd:stop
> > (process
> > >> > 22773) failed to redirect stdout for its background child
> > (daemon)
> > >> > processes. This will likely cause those processes to die
> > >> mysteriously at
> > >> > some later time (terminated by signal SIGPIPE).
> > >> > Mar 28 10:56:04 test-1 lrmd: [22008]: info: RA output:
> > >> (httpd:stop:stdout)
> > >> > httpd (pid 22165 22164 22163 22162 22161 22160 22159 22157 22155)
> > is
> > >> > running...
> > >> > Mar 28 10:56:04 test-1 crmd: [22011]: WARN: process_lrm_event:
> > lrm.c LRM
> > >> > operation (44) stop_0 on httpd Error: (1) unknown error
> > >>
> > >> I'd strongly recommend that you use the OCF RA in stead of your
> > >> distributions init script. It is otherwise rather difficult to
> > >> figure out what this error means apart from the fact that the stop
> > >> op failed. I wonder why did it show up as WARN and not ERROR.
> >
> > I agree.  Also, our resource agent monitors apache much better than
> > status on the LSB init script.
> >
> >
> > --
> >     Alan Robertson <[EMAIL PROTECTED]>
> >
> > "Openness is the foundation and preservative of friendship...  Let me
> > claim from you at all times your undisguised opinions." - William
> > Wilberforce
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] R2 Two-node apache cluster with STONITH

Reply via email to