can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop
Il giorno 01 marzo 2012 01:03, William Seligman <selig...@nevis.columbia.edu > ha scritto: > On 2/28/12 7:26 PM, Lars Ellenberg wrote: > > On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: > >> <off-topic> > >> Sigh. I wish that were the reason. > >> > >> The reason why I'm doing dual-primary is that I've a got a > single-primary > >> two-node cluster in production that simply doesn't work. One node runs > >> resources; the other sits and twiddles its fingers; fine. But when > primary goes > >> down, secondary has trouble starting up all the resources; when we've > actually > >> had primary failures (UPS goes haywire, hard drive failure) the > secondary often > >> winds up in a state in which it runs none of the significant resources. > >> > >> With the dual-primary setup I have now, both machines are running the > resources > >> that typically cause problems in my single-primary configuration. If > one box > >> goes down, the other doesn't have to failover anything; it's already > running > >> them. (I needed IPaddr2 cloning to work properly for this to work, > which is why > >> I started that thread... and all the stupider of me for missing that > crucial > >> page in Clusters From Scratch.) > >> > >> My only remaining problem with the configuration is restoring a fenced > node to > >> the cluster. Hence my tests, and the reason why I started this thread. > >> </off-topic> > > > > Uhm, I do think that is exactly on topic. > > > > Rather fix your resources to be able to successfully take over, > > than add even more complexity. > > > > What resources would that be, > > and why are they not taking over? > > I can't tell you in detail, because the major snafu happened on a > production > system after a power outage a few months ago. My goal was to get the thing > stable as quickly as possible. In the end, that turned out to be a non-HA > configuration: One runs corosync+pacemaker+drbd, while the other just runs > drbd. > It works, in the sense that the users get their e-mail. If there's a power > outage, I have to bring things up manually. > > So my only reference is the test-bench dual-primary setup I've got now, > which is > exhibiting the same kinds of problems even though the OS versions, software > versions, and layout are different. This suggests that the problem lies in > the > way I'm setting up the configuration. > > The problems I have seem to be in the general category of "the 'good guy' > gets > fenced when the 'bad guy' gets into trouble." Examples: > > - Assuming I start out with two crashed nodes. If I just start up DRBD and > nothing else, the partitions sync quickly with no problems. > > - If the system starts with cman running, and I start drbd, it's likely > that > system who is _not_ Outdated will be fenced (rebooted). Same thing if > cman+pacemaker is running. > > - Cloned ocf:heartbeat:exportfs resources are giving me problems as well > (which > is why I tried making changes to that resource script). Assume I start > with one > node running cman+pacemaker, and the other stopped. I turned on the stopped > node. This will typically result in the running node being fenced, because > it > has it times out when stopping the exportfs resource. > > Falling back to DRBD 8.3.12 didn't change this behavior. > > My pacemaker configuration is long, so I'll excerpt what I think are the > relevant pieces in the hope that it will be enough for someone to say "You > fool! > This is covered in Pacemaker Explained page 56!" When bringing up a stopped > node, in order to restart AdminClone pacemaker wants to stop ExportsClone, > then > Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail > on > the running node that causes it to be fenced. > > primitive AdminDrbd ocf:linbit:drbd \ > params drbd_resource="admin" \ > op monitor interval="60s" role="Master" \ > op monitor interval="59s" role="Slave" \ > op stop interval="0" timeout="320" \ > op start interval="0" timeout="240" > ms AdminClone AdminDrbd \ > meta master-max="2" master-node-max="1" \ > clone-max="2" clone-node-max="1" notify="true" > > primitive Clvmd lsb:clvmd op monitor interval="30s" > clone ClvmdClone Clvmd > colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master > order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start > > primitive Gfs2 lsb:gfs2 op monitor interval="30s" > clone Gfs2Clone Gfs2 > colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone > order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone > > primitive ExportMail ocf:heartbeat:exportfs \ > op start interval="0" timeout="40" \ > op stop interval="0" timeout="45" \ > params clientspec="mail" directory="/mail" fsid="30" > clone ExportsClone ExportMail > colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone > order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone > > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems