Ok william if this it'sn the problem, when you show me your pacemaker cib xml
crm configure show > OUTPUT Il giorno 01 marzo 2012 18:10, William Seligman <selig...@nevis.columbia.edu > ha scritto: > On 3/1/12 6:34 AM, emmanuel segura wrote: > > try to change the fence daemon tag like this > > ==================================== > > <fence_daemon clean_start="1" post_join_delay="30" /> > > ==================================== > > change your cluster config version and after reboot the cluster > > This did not change the behavior of the cluster. In particular, I'm still > dealing with this: > > >>>> - If the system starts with cman running, and I start drbd, it's > >>>> likely that system who is _not_ Outdated will be fenced (rebooted). > > > Il giorno 01 marzo 2012 12:28, William Seligman < > selig...@nevis.columbia.edu > >> ha scritto: > > > >> On 3/1/12 4:15 AM, emmanuel segura wrote: > >> > >>> can you show me your /etc/cluster/cluster.conf? > >>> > >>> because i think your problem it's a fencing-loop > >>> > >> > >> Here it is: > >> > >> /etc/cluster/cluster.conf: > >> > >> <?xml version="1.0"?> > >> <cluster config_version="17" name="Nevis_HA"> > >> <logging debug="off"/> > >> <cman expected_votes="1" two_node="1" /> > >> <clusternodes> > >> <clusternode name="hypatia-tb.nevis.**columbia.edu< > http://hypatia-tb.nevis.columbia.edu>" > >> nodeid="1"> > >> <altname name="hypatia-private.nevis.**columbia.edu< > http://hypatia-private.nevis.columbia.edu>" > >> port="5405" > >> mcast="226.94.1.1"/> > >> <fence> > >> <method name="pcmk-redirect"> > >> <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu< > http://hypatia-tb.nevis.columbia.edu> > >> "/> > >> </method> > >> </fence> > >> </clusternode> > >> <clusternode name="orestes-tb.nevis.**columbia.edu< > http://orestes-tb.nevis.columbia.edu>" > >> nodeid="2"> > >> <altname name="orestes-private.nevis.**columbia.edu< > http://orestes-private.nevis.columbia.edu>" > >> port="5405" > >> mcast="226.94.1.1"/> > >> <fence> > >> <method name="pcmk-redirect"> > >> <device name="pcmk" port="orestes-tb.nevis.**columbia.edu< > http://orestes-tb.nevis.columbia.edu> > >> "/> > >> </method> > >> </fence> > >> </clusternode> > >> </clusternodes> > >> <fencedevices> > >> <fencedevice name="pcmk" agent="fence_pcmk"/> > >> </fencedevices> > >> <fence_daemon post_join_delay="30" /> > >> <rm disabled="1" /> > >> </cluster> > >> > >> > >> > >> Il giorno 01 marzo 2012 01:03, William Seligman<seligman@nevis.** > >>> columbia.edu <selig...@nevis.columbia.edu> > >>> > >>>> ha scritto: > >>>> > >>> > >>> On 2/28/12 7:26 PM, Lars Ellenberg wrote: > >>>> > >>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: > >>>>> > >>>>>> <off-topic> > >>>>>> Sigh. I wish that were the reason. > >>>>>> > >>>>>> The reason why I'm doing dual-primary is that I've a got a > >>>>>> single-primary two-node cluster in production that simply doesn't > >>>>>> work. One node runs resources; the other sits and twiddles its > >>>>>> fingers; fine. But when primary goes down, secondary has trouble > >>>>>> starting up all the resources; when we've actually had primary > >>>>>> failures (UPS goes haywire, hard drive failure) the secondary often > >>>>>> winds up in a state in which it runs none of the significant > >>>>>> resources. > >>>>>> > >>>>>> With the dual-primary setup I have now, both machines are running > >>>>>> the resources that typically cause problems in my single-primary > >>>>>> configuration. If one box goes down, the other doesn't have to > >>>>>> failover anything; it's already running them. (I needed IPaddr2 > >>>>>> cloning to work properly for this to work, which is why I started > >>>>>> that thread... and all the stupider of me for missing that crucial > >>>>>> page in Clusters From Scratch.) > >>>>>> > >>>>>> My only remaining problem with the configuration is restoring a > >>>>>> fenced node to the cluster. Hence my tests, and the reason why I > >>>>>> started this thread. > >>>>>> </off-topic> > > >>>>>> > >>>>> > >>>>> Uhm, I do think that is exactly on topic. > >>>>> > >>>>> Rather fix your resources to be able to successfully take over, > >>>>> than add even more complexity. > >>>>> > >>>>> What resources would that be, > >>>>> and why are they not taking over? > >>>>> > >>>> > >>>> I can't tell you in detail, because the major snafu happened on a > >>>> production system after a power outage a few months ago. My goal was > to > >>>> get the thing stable as quickly as possible. In the end, that turned > >>>> out to be a non-HA configuration: One runs corosync+pacemaker+drbd, > >>>> while the other just runs drbd. It works, in the sense that the users > >>>> get their e-mail. If there's a power outage, I have to bring things up > >>>> manually. > >>>> > >>>> So my only reference is the test-bench dual-primary setup I've got > >>>> now, which is exhibiting the same kinds of problems even though the OS > >>>> versions, software versions, and layout are different. This suggests > >>>> that the problem lies in the way I'm setting up the configuration. > >>>> > >>>> The problems I have seem to be in the general category of "the 'good > >>>> guy' gets fenced when the 'bad guy' gets into trouble." Examples: > >>>> > >>>> - Assuming I start out with two crashed nodes. If I just start up DRBD > >>>> and nothing else, the partitions sync quickly with no problems. > >>>> > >>>> - If the system starts with cman running, and I start drbd, it's > >>>> likely that system who is _not_ Outdated will be fenced (rebooted). > >>>> Same thing if cman+pacemaker is running. > >>>> > >>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as > >>>> well (which is why I tried making changes to that resource script). > >>>> Assume I start with one node running cman+pacemaker, and the other > >>>> stopped. I turned on the stopped node. This will typically result in > >>>> the running node being fenced, because it has it times out when > >>>> stopping the exportfs resource. > >>>> > >>>> Falling back to DRBD 8.3.12 didn't change this behavior. > >>>> > >>>> My pacemaker configuration is long, so I'll excerpt what I think are > >>>> the relevant pieces in the hope that it will be enough for someone to > >>>> say "You fool! This is covered in Pacemaker Explained page 56!" When > >>>> bringing up a stopped node, in order to restart AdminClone pacemaker > >>>> wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I > said, > >>>> it's the failure to stop ExportMail on the running node that causes it > >>>> to be fenced. > >>>> > >>>> primitive AdminDrbd ocf:linbit:drbd \ > >>>> params drbd_resource="admin" \ > >>>> op monitor interval="60s" role="Master" \ > >>>> op monitor interval="59s" role="Slave" \ > >>>> op stop interval="0" timeout="320" \ > >>>> op start interval="0" timeout="240" > >>>> ms AdminClone AdminDrbd \ > >>>> meta master-max="2" master-node-max="1" \ > >>>> clone-max="2" clone-node-max="1" notify="true" > >>>> > >>>> primitive Clvmd lsb:clvmd op monitor interval="30s" > >>>> clone ClvmdClone Clvmd > >>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master > >>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start > >>>> > >>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s" > >>>> clone Gfs2Clone Gfs2 > >>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone > >>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone > >>>> > >>>> primitive ExportMail ocf:heartbeat:exportfs \ > >>>> op start interval="0" timeout="40" \ > >>>> op stop interval="0" timeout="45" \ > >>>> params clientspec="mail" directory="/mail" fsid="30" > >>>> clone ExportsClone ExportMail > >>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone > >>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone > > > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems