try to change the fence daemon tag like this ==================================== <fence_daemon clean_start="1" post_join_delay="30" /> ==================================== change your cluster config version and after reboot the cluster
Il giorno 01 marzo 2012 12:28, William Seligman <selig...@nevis.columbia.edu > ha scritto: > On 3/1/12 4:15 AM, emmanuel segura wrote: > >> can you show me your /etc/cluster/cluster.conf? >> >> because i think your problem it's a fencing-loop >> > > Here it is: > > /etc/cluster/cluster.conf: > > <?xml version="1.0"?> > <cluster config_version="17" name="Nevis_HA"> > <logging debug="off"/> > <cman expected_votes="1" two_node="1" /> > <clusternodes> > <clusternode > name="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>" > nodeid="1"> > <altname > name="hypatia-private.nevis.**columbia.edu<http://hypatia-private.nevis.columbia.edu>" > port="5405" > mcast="226.94.1.1"/> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" > port="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu> > "/> > </method> > </fence> > </clusternode> > <clusternode > name="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>" > nodeid="2"> > <altname > name="orestes-private.nevis.**columbia.edu<http://orestes-private.nevis.columbia.edu>" > port="5405" > mcast="226.94.1.1"/> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" > port="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu> > "/> > </method> > </fence> > </clusternode> > </clusternodes> > <fencedevices> > <fencedevice name="pcmk" agent="fence_pcmk"/> > </fencedevices> > <fence_daemon post_join_delay="30" /> > <rm disabled="1" /> > </cluster> > > > > Il giorno 01 marzo 2012 01:03, William Seligman<seligman@nevis.** >> columbia.edu <selig...@nevis.columbia.edu> >> >>> ha scritto: >>> >> >> On 2/28/12 7:26 PM, Lars Ellenberg wrote: >>> >>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: >>>> >>>>> <off-topic> >>>>> Sigh. I wish that were the reason. >>>>> >>>>> The reason why I'm doing dual-primary is that I've a got a >>>>> >>>> single-primary >>> >>>> two-node cluster in production that simply doesn't work. One node runs >>>>> resources; the other sits and twiddles its fingers; fine. But when >>>>> >>>> primary goes >>> >>>> down, secondary has trouble starting up all the resources; when we've >>>>> >>>> actually >>> >>>> had primary failures (UPS goes haywire, hard drive failure) the >>>>> >>>> secondary often >>> >>>> winds up in a state in which it runs none of the significant resources. >>>>> >>>>> With the dual-primary setup I have now, both machines are running the >>>>> >>>> resources >>> >>>> that typically cause problems in my single-primary configuration. If >>>>> >>>> one box >>> >>>> goes down, the other doesn't have to failover anything; it's already >>>>> >>>> running >>> >>>> them. (I needed IPaddr2 cloning to work properly for this to work, >>>>> >>>> which is why >>> >>>> I started that thread... and all the stupider of me for missing that >>>>> >>>> crucial >>> >>>> page in Clusters From Scratch.) >>>>> >>>>> My only remaining problem with the configuration is restoring a fenced >>>>> >>>> node to >>> >>>> the cluster. Hence my tests, and the reason why I started this thread. >>>>> </off-topic> >>>>> >>>> >>>> Uhm, I do think that is exactly on topic. >>>> >>>> Rather fix your resources to be able to successfully take over, >>>> than add even more complexity. >>>> >>>> What resources would that be, >>>> and why are they not taking over? >>>> >>> >>> I can't tell you in detail, because the major snafu happened on a >>> production >>> system after a power outage a few months ago. My goal was to get the >>> thing >>> stable as quickly as possible. In the end, that turned out to be a non-HA >>> configuration: One runs corosync+pacemaker+drbd, while the other just >>> runs >>> drbd. >>> It works, in the sense that the users get their e-mail. If there's a >>> power >>> outage, I have to bring things up manually. >>> >>> So my only reference is the test-bench dual-primary setup I've got now, >>> which is >>> exhibiting the same kinds of problems even though the OS versions, >>> software >>> versions, and layout are different. This suggests that the problem lies >>> in >>> the >>> way I'm setting up the configuration. >>> >>> The problems I have seem to be in the general category of "the 'good guy' >>> gets >>> fenced when the 'bad guy' gets into trouble." Examples: >>> >>> - Assuming I start out with two crashed nodes. If I just start up DRBD >>> and >>> nothing else, the partitions sync quickly with no problems. >>> >>> - If the system starts with cman running, and I start drbd, it's likely >>> that >>> system who is _not_ Outdated will be fenced (rebooted). Same thing if >>> cman+pacemaker is running. >>> >>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well >>> (which >>> is why I tried making changes to that resource script). Assume I start >>> with one >>> node running cman+pacemaker, and the other stopped. I turned on the >>> stopped >>> node. This will typically result in the running node being fenced, >>> because >>> it >>> has it times out when stopping the exportfs resource. >>> >>> Falling back to DRBD 8.3.12 didn't change this behavior. >>> >>> My pacemaker configuration is long, so I'll excerpt what I think are the >>> relevant pieces in the hope that it will be enough for someone to say >>> "You >>> fool! >>> This is covered in Pacemaker Explained page 56!" When bringing up a >>> stopped >>> node, in order to restart AdminClone pacemaker wants to stop >>> ExportsClone, >>> then >>> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop >>> ExportMail >>> on >>> the running node that causes it to be fenced. >>> >>> primitive AdminDrbd ocf:linbit:drbd \ >>> params drbd_resource="admin" \ >>> op monitor interval="60s" role="Master" \ >>> op monitor interval="59s" role="Slave" \ >>> op stop interval="0" timeout="320" \ >>> op start interval="0" timeout="240" >>> ms AdminClone AdminDrbd \ >>> meta master-max="2" master-node-max="1" \ >>> clone-max="2" clone-node-max="1" notify="true" >>> >>> primitive Clvmd lsb:clvmd op monitor interval="30s" >>> clone ClvmdClone Clvmd >>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master >>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start >>> >>> primitive Gfs2 lsb:gfs2 op monitor interval="30s" >>> clone Gfs2Clone Gfs2 >>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone >>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone >>> >>> primitive ExportMail ocf:heartbeat:exportfs \ >>> op start interval="0" timeout="40" \ >>> op stop interval="0" timeout="45" \ >>> params clientspec="mail" directory="/mail" fsid="30" >>> clone ExportsClone ExportMail >>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone >>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone >>> >> > > -- > Bill Seligman | > mailto://seligman@nevis.**columbia.edu<selig...@nevis.columbia.edu> > Nevis Labs, Columbia Univ | > http://www.nevis.columbia.edu/**~seligman/<http://www.nevis.columbia.edu/%7Eseligman/> > PO Box 137 | > Irvington NY 10533 USA | Phone: (914) 591-2823 > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems