Hi Andrew,
I opened Bugzilla 1572, and included as much as I could find that I felt
was relevant. It has the complete logs, cibadmin -Q output, diffs of two
C files that had to be changed to build, and the two OCF scripts I'm
using that have had minor modifications made to them (mostly for
additional debug output). If there is anything else that would be
useful, let me know. In the mean time I'm going to rebuild from scratch
heartbeat on node2, since that seems to be where the problem starts.

Doug

On Mon, 2007-05-07 at 10:13 +0200, Andrew Beekhof wrote:

> can you open a bug for this and include the _complete_ logs as well as
> which version you're running (as I no longer recall)
> 
> On 5/4/07, Doug Knight <[EMAIL PROTECTED]> wrote:
> > It seems the two nodes in my cluster are behaving differently from each
> > other. First, some simplification/mapping for node names to compare to
> > the attached logs:
> >
> > node1 - arc-tkincaidlx
> > node2 - arc-dknightlx
> >
> > And references to the resource group include Filesystem, pgsql, and
> > IPaddr colocated and ordered resources
> >
> > Heartbeat shutdowns and restarts on node1, regardless of whether it is
> > DC, has active resources, etc, all perform as expected. If the resources
> > are on node1, they migrate successfully to node2. If the location
> > constraint sets the resources to node1, and node1 re-enters the cluster,
> > all resources migrate back. Its when ANY heartbeat stop, start, restart,
> > occurs on node2 that things break. For instance:
> >
> > node1 is DC, master rsc_drbd_7788:1, group active
> > node2 is slave rsc_drbd_7788:0 ONLY
> > /etc/init.d/heartbeat stop is executed on node2
> > node1 tries to execute a demote on rsc_drbd_7788:1
> > demote fails because group is active on node1, Filesystem is holding the
> > drbd device open via mount point
> > heartbeat continues to loop trying to demote on node1, about 9 times a
> > second
> > heartbeat on node2, where stop was executed, loops calling
> > notify/pre/demote on rsc_drbd_7788:0, about once a second
> >
> > It takes a manual kill of heartbeat to get things back in order, and in
> > the mean time drbd goes split brain, or so it seems by what I have to do
> > to manually get drbd connected again. So, the problem is that heartbeat
> > thinks it needs to demote the master rsc_drbd_7788:1 resource, and even
> > if this was correct, it doesn't handle the group resources that are
> > dependent on it and ordered/colocated with it. The attached logs cover
> > the entire sequence of events during the shutdown of heartbeat on node2.
> > Times of significance to help in looking at the logs are:
> >
> > Node2 HB shutdown started at 14:03:31
> > Manually started killing HB on node2 at 14:05:33
> > Node2 completed HB shutdown at 14:06:03
> > Node2 Timer pop at 14:06:33
> > Node1 HB shutdown to try to alleviate looping at 14:07:51
> >
> >  The logs are kind of large due to the looping (I deleted most of the
> > looping, so if more info is needed I can provide the complete logs), and
> > I've zipped them up, so if this email exceeds the list's size limits I
> > respectfully ask the moderator to allow it to go through.
> >
> > Doug Knight
> > WSI, Inc.
> >
> >
> > > > > > digging into that now. If I shutdown the node that does not have the
> > > > > > active resources, the following happens:
> > > > > >
> > > > > > (State: DC on active node1, running drbd master and group resources)
> > > > > > shutdown node2
> > > > > > demote attempted on node1 for drbd master,
> > > > >
> > > > > Why demote? It's master running on a good node.
> > > > >
> > > >
> > > > Don't know, this is what I observed. I wondered why it would do a demote
> > > > when this node is already OK.
> > > >
> > > > > > no attempt at halting groups
> > > > > > resources that depend on drbd
> > > > >
> > > > > Why should the resources be stopped? You shutdown a node which
> > > > > doesn't have any resources.
> > > > >
> > > >
> >
> > truncated...
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to