can you show me your /etc/cluster/cluster.conf?

because i think your problem it's a fencing-loop

Il giorno 01 marzo 2012 01:03, William Seligman <selig...@nevis.columbia.edu
> ha scritto:

> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
> > On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
> >> <off-topic>
> >> Sigh. I wish that were the reason.
> >>
> >> The reason why I'm doing dual-primary is that I've a got a
> single-primary
> >> two-node cluster in production that simply doesn't work. One node runs
> >> resources; the other sits and twiddles its fingers; fine. But when
> primary goes
> >> down, secondary has trouble starting up all the resources; when we've
> actually
> >> had primary failures (UPS goes haywire, hard drive failure) the
> secondary often
> >> winds up in a state in which it runs none of the significant resources.
> >>
> >> With the dual-primary setup I have now, both machines are running the
> resources
> >> that typically cause problems in my single-primary configuration. If
> one box
> >> goes down, the other doesn't have to failover anything; it's already
> running
> >> them. (I needed IPaddr2 cloning to work properly for this to work,
> which is why
> >> I started that thread... and all the stupider of me for missing that
> crucial
> >> page in Clusters From Scratch.)
> >>
> >> My only remaining problem with the configuration is restoring a fenced
> node to
> >> the cluster. Hence my tests, and the reason why I started this thread.
> >> </off-topic>
> >
> > Uhm, I do think that is exactly on topic.
> >
> > Rather fix your resources to be able to successfully take over,
> > than add even more complexity.
> >
> > What resources would that be,
> > and why are they not taking over?
>
> I can't tell you in detail, because the major snafu happened on a
> production
> system after a power outage a few months ago. My goal was to get the thing
> stable as quickly as possible. In the end, that turned out to be a non-HA
> configuration: One runs corosync+pacemaker+drbd, while the other just runs
> drbd.
> It works, in the sense that the users get their e-mail. If there's a power
> outage, I have to bring things up manually.
>
> So my only reference is the test-bench dual-primary setup I've got now,
> which is
> exhibiting the same kinds of problems even though the OS versions, software
> versions, and layout are different. This suggests that the problem lies in
> the
> way I'm setting up the configuration.
>
> The problems I have seem to be in the general category of "the 'good guy'
> gets
> fenced when the 'bad guy' gets into trouble." Examples:
>
> - Assuming I start out with two crashed nodes. If I just start up DRBD and
> nothing else, the partitions sync quickly with no problems.
>
> - If the system starts with cman running, and I start drbd, it's likely
> that
> system who is _not_ Outdated will be fenced (rebooted). Same thing if
> cman+pacemaker is running.
>
> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
> (which
> is why I tried making changes to that resource script). Assume I start
> with one
> node running cman+pacemaker, and the other stopped. I turned on the stopped
> node. This will typically result in the running node being fenced, because
> it
> has it times out when stopping the exportfs resource.
>
> Falling back to DRBD 8.3.12 didn't change this behavior.
>
> My pacemaker configuration is long, so I'll excerpt what I think are the
> relevant pieces in the hope that it will be enough for someone to say "You
> fool!
> This is covered in Pacemaker Explained page 56!" When bringing up a stopped
> node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
> then
> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
> on
> the running node that causes it to be fenced.
>
> primitive AdminDrbd ocf:linbit:drbd \
>        params drbd_resource="admin" \
>        op monitor interval="60s" role="Master" \
>        op monitor interval="59s" role="Slave" \
>        op stop interval="0" timeout="320" \
>        op start interval="0" timeout="240"
> ms AdminClone AdminDrbd \
>        meta master-max="2" master-node-max="1" \
>        clone-max="2" clone-node-max="1" notify="true"
>
> primitive Clvmd lsb:clvmd op monitor interval="30s"
> clone ClvmdClone Clvmd
> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>
> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
> clone Gfs2Clone Gfs2
> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>
> primitive ExportMail ocf:heartbeat:exportfs \
>        op start interval="0" timeout="40" \
>        op stop interval="0" timeout="45" \
>        params clientspec="mail" directory="/mail" fsid="30"
> clone ExportsClone ExportMail
> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to