Re: [Linux-HA] cman+pacemaker+drbd fencing problem

emmanuel segura Thu, 01 Mar 2012 09:27:27 -0800

Ok william

if this it'sn the problem, when you show me your pacemaker cib xml


crm configure show > OUTPUT

Il giorno 01 marzo 2012 18:10, William Seligman <selig...@nevis.columbia.edu
> ha scritto:

> On 3/1/12 6:34 AM, emmanuel segura wrote:
> > try to change the fence daemon tag like this
> > ====================================
> >  <fence_daemon clean_start="1" post_join_delay="30" />
> > ====================================
> > change your cluster config version and after reboot the cluster
>
> This did not change the behavior of the cluster. In particular, I'm still
> dealing with this:
>
> >>>> - If the system starts with cman running, and I start drbd, it's
> >>>> likely that system who is _not_ Outdated will be fenced (rebooted).
>
> > Il giorno 01 marzo 2012 12:28, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/1/12 4:15 AM, emmanuel segura wrote:
> >>
> >>> can you show me your /etc/cluster/cluster.conf?
> >>>
> >>> because i think your problem it's a fencing-loop
> >>>
> >>
> >> Here it is:
> >>
> >> /etc/cluster/cluster.conf:
> >>
> >> <?xml version="1.0"?>
> >> <cluster config_version="17" name="Nevis_HA">
> >>  <logging debug="off"/>
> >>  <cman expected_votes="1" two_node="1" />
> >>  <clusternodes>
> >>    <clusternode name="hypatia-tb.nevis.**columbia.edu<
> http://hypatia-tb.nevis.columbia.edu>"
> >> nodeid="1">
> >>      <altname name="hypatia-private.nevis.**columbia.edu<
> http://hypatia-private.nevis.columbia.edu>"
> >> port="5405"
> >> mcast="226.94.1.1"/>
> >>      <fence>
> >>        <method name="pcmk-redirect">
> >>          <device name="pcmk" port="hypatia-tb.nevis.**columbia.edu<
> http://hypatia-tb.nevis.columbia.edu>
> >> "/>
> >>        </method>
> >>      </fence>
> >>    </clusternode>
> >>    <clusternode name="orestes-tb.nevis.**columbia.edu<
> http://orestes-tb.nevis.columbia.edu>"
> >> nodeid="2">
> >>      <altname name="orestes-private.nevis.**columbia.edu<
> http://orestes-private.nevis.columbia.edu>"
> >> port="5405"
> >> mcast="226.94.1.1"/>
> >>      <fence>
> >>        <method name="pcmk-redirect">
> >>          <device name="pcmk" port="orestes-tb.nevis.**columbia.edu<
> http://orestes-tb.nevis.columbia.edu>
> >> "/>
> >>        </method>
> >>      </fence>
> >>    </clusternode>
> >>  </clusternodes>
> >>  <fencedevices>
> >>    <fencedevice name="pcmk" agent="fence_pcmk"/>
> >>  </fencedevices>
> >>  <fence_daemon post_join_delay="30" />
> >>  <rm disabled="1" />
> >> </cluster>
> >>
> >>
> >>
> >>  Il giorno 01 marzo 2012 01:03, William Seligman<seligman@nevis.**
> >>> columbia.edu <selig...@nevis.columbia.edu>
> >>>
> >>>> ha scritto:
> >>>>
> >>>
> >>>  On 2/28/12 7:26 PM, Lars Ellenberg wrote:
> >>>>
> >>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
> >>>>>
> >>>>>> <off-topic>
> >>>>>> Sigh. I wish that were the reason.
> >>>>>>
> >>>>>> The reason why I'm doing dual-primary is that I've a got a
> >>>>>> single-primary two-node cluster in production that simply doesn't
> >>>>>> work. One node runs resources; the other sits and twiddles its
> >>>>>> fingers; fine. But when primary goes down, secondary has trouble
> >>>>>> starting up all the resources; when we've actually had primary
> >>>>>> failures (UPS goes haywire, hard drive failure) the secondary often
> >>>>>> winds up in a state in which it runs none of the significant
> >>>>>> resources.
> >>>>>>
> >>>>>> With the dual-primary setup I have now, both machines are running
> >>>>>> the resources that typically cause problems in my single-primary
> >>>>>> configuration. If one box goes down, the other doesn't have to
> >>>>>> failover anything; it's already running them. (I needed IPaddr2
> >>>>>> cloning to work properly for this to work, which is why I started
> >>>>>> that thread... and all the stupider of me for missing that crucial
> >>>>>> page in Clusters From Scratch.)
> >>>>>>
> >>>>>> My only remaining problem with the configuration is restoring a
> >>>>>> fenced node to the cluster. Hence my tests, and the reason why I
> >>>>>> started this thread.
> >>>>>> </off-topic>
>
> >>>>>>
> >>>>>
> >>>>> Uhm, I do think that is exactly on topic.
> >>>>>
> >>>>> Rather fix your resources to be able to successfully take over,
> >>>>> than add even more complexity.
> >>>>>
> >>>>> What resources would that be,
> >>>>> and why are they not taking over?
> >>>>>
> >>>>
> >>>> I can't tell you in detail, because the major snafu happened on a
> >>>> production system after a power outage a few months ago. My goal was
> to
> >>>> get the thing stable as quickly as possible. In the end, that turned
> >>>> out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
> >>>> while the other just runs drbd. It works, in the sense that the users
> >>>> get their e-mail. If there's a power outage, I have to bring things up
> >>>> manually.
> >>>>
> >>>> So my only reference is the test-bench dual-primary setup I've got
> >>>> now, which is exhibiting the same kinds of problems even though the OS
> >>>> versions, software versions, and layout are different. This suggests
> >>>> that the problem lies in the way I'm setting up the configuration.
> >>>>
> >>>> The problems I have seem to be in the general category of "the 'good
> >>>> guy' gets fenced when the 'bad guy' gets into trouble." Examples:
> >>>>
> >>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
> >>>> and nothing else, the partitions sync quickly with no problems.
> >>>>
> >>>> - If the system starts with cman running, and I start drbd, it's
> >>>> likely that system who is _not_ Outdated will be fenced (rebooted).
> >>>> Same thing if cman+pacemaker is running.
> >>>>
> >>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as
> >>>> well (which is why I tried making changes to that resource script).
> >>>> Assume I start with one node running cman+pacemaker, and the other
> >>>> stopped. I turned on the stopped node. This will typically result in
> >>>> the running node being fenced, because it has it times out when
> >>>> stopping the exportfs resource.
> >>>>
> >>>> Falling back to DRBD 8.3.12 didn't change this behavior.
> >>>>
> >>>> My pacemaker configuration is long, so I'll excerpt what I think are
> >>>> the relevant pieces in the hope that it will be enough for someone to
> >>>> say "You fool! This is covered in Pacemaker Explained page 56!" When
> >>>> bringing up a stopped node, in order to restart AdminClone pacemaker
> >>>> wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I
> said,
> >>>> it's the failure to stop ExportMail on the running node that causes it
> >>>> to be fenced.
> >>>>
> >>>> primitive AdminDrbd ocf:linbit:drbd \
> >>>>        params drbd_resource="admin" \
> >>>>        op monitor interval="60s" role="Master" \
> >>>>        op monitor interval="59s" role="Slave" \
> >>>>        op stop interval="0" timeout="320" \
> >>>>        op start interval="0" timeout="240"
> >>>> ms AdminClone AdminDrbd \
> >>>>        meta master-max="2" master-node-max="1" \
> >>>>        clone-max="2" clone-node-max="1" notify="true"
> >>>>
> >>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
> >>>> clone ClvmdClone Clvmd
> >>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
> >>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
> >>>>
> >>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
> >>>> clone Gfs2Clone Gfs2
> >>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
> >>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
> >>>>
> >>>> primitive ExportMail ocf:heartbeat:exportfs \
> >>>>        op start interval="0" timeout="40" \
> >>>>        op stop interval="0" timeout="45" \
> >>>>        params clientspec="mail" directory="/mail" fsid="30"
> >>>> clone ExportsClone ExportMail
> >>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
> >>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

Reply via email to