Re: [Linux-HA] cman+pacemaker+drbd fencing problem

emmanuel segura Thu, 01 Mar 2012 03:35:31 -0800

try to change the fence daemon tag like this
====================================
 <fence_daemon clean_start="1" post_join_delay="30" />
====================================
change your cluster config version and after reboot the cluster


Il giorno 01 marzo 2012 12:28, William Seligman <selig...@nevis.columbia.edu
> ha scritto:

> On 3/1/12 4:15 AM, emmanuel segura wrote:
>
>> can you show me your /etc/cluster/cluster.conf?
>>
>> because i think your problem it's a fencing-loop
>>
>
> Here it is:
>
> /etc/cluster/cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="17" name="Nevis_HA">
>  <logging debug="off"/>
>  <cman expected_votes="1" two_node="1" />
>  <clusternodes>
>    <clusternode 
> name="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>"
> nodeid="1">
>      <altname 
> name="hypatia-private.nevis.**columbia.edu<http://hypatia-private.nevis.columbia.edu>"
> port="5405"
> mcast="226.94.1.1"/>
>      <fence>
>        <method name="pcmk-redirect">
>          <device name="pcmk" 
> port="hypatia-tb.nevis.**columbia.edu<http://hypatia-tb.nevis.columbia.edu>
> "/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode 
> name="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>"
> nodeid="2">
>      <altname 
> name="orestes-private.nevis.**columbia.edu<http://orestes-private.nevis.columbia.edu>"
> port="5405"
> mcast="226.94.1.1"/>
>      <fence>
>        <method name="pcmk-redirect">
>          <device name="pcmk" 
> port="orestes-tb.nevis.**columbia.edu<http://orestes-tb.nevis.columbia.edu>
> "/>
>        </method>
>      </fence>
>    </clusternode>
>  </clusternodes>
>  <fencedevices>
>    <fencedevice name="pcmk" agent="fence_pcmk"/>
>  </fencedevices>
>  <fence_daemon post_join_delay="30" />
>  <rm disabled="1" />
> </cluster>
>
>
>
>  Il giorno 01 marzo 2012 01:03, William Seligman<seligman@nevis.**
>> columbia.edu <selig...@nevis.columbia.edu>
>>
>>> ha scritto:
>>>
>>
>>  On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>>
>>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>>
>>>>> <off-topic>
>>>>> Sigh. I wish that were the reason.
>>>>>
>>>>> The reason why I'm doing dual-primary is that I've a got a
>>>>>
>>>> single-primary
>>>
>>>> two-node cluster in production that simply doesn't work. One node runs
>>>>> resources; the other sits and twiddles its fingers; fine. But when
>>>>>
>>>> primary goes
>>>
>>>> down, secondary has trouble starting up all the resources; when we've
>>>>>
>>>> actually
>>>
>>>> had primary failures (UPS goes haywire, hard drive failure) the
>>>>>
>>>> secondary often
>>>
>>>> winds up in a state in which it runs none of the significant resources.
>>>>>
>>>>> With the dual-primary setup I have now, both machines are running the
>>>>>
>>>> resources
>>>
>>>> that typically cause problems in my single-primary configuration. If
>>>>>
>>>> one box
>>>
>>>> goes down, the other doesn't have to failover anything; it's already
>>>>>
>>>> running
>>>
>>>> them. (I needed IPaddr2 cloning to work properly for this to work,
>>>>>
>>>> which is why
>>>
>>>> I started that thread... and all the stupider of me for missing that
>>>>>
>>>> crucial
>>>
>>>> page in Clusters From Scratch.)
>>>>>
>>>>> My only remaining problem with the configuration is restoring a fenced
>>>>>
>>>> node to
>>>
>>>> the cluster. Hence my tests, and the reason why I started this thread.
>>>>> </off-topic>
>>>>>
>>>>
>>>> Uhm, I do think that is exactly on topic.
>>>>
>>>> Rather fix your resources to be able to successfully take over,
>>>> than add even more complexity.
>>>>
>>>> What resources would that be,
>>>> and why are they not taking over?
>>>>
>>>
>>> I can't tell you in detail, because the major snafu happened on a
>>> production
>>> system after a power outage a few months ago. My goal was to get the
>>> thing
>>> stable as quickly as possible. In the end, that turned out to be a non-HA
>>> configuration: One runs corosync+pacemaker+drbd, while the other just
>>> runs
>>> drbd.
>>> It works, in the sense that the users get their e-mail. If there's a
>>> power
>>> outage, I have to bring things up manually.
>>>
>>> So my only reference is the test-bench dual-primary setup I've got now,
>>> which is
>>> exhibiting the same kinds of problems even though the OS versions,
>>> software
>>> versions, and layout are different. This suggests that the problem lies
>>> in
>>> the
>>> way I'm setting up the configuration.
>>>
>>> The problems I have seem to be in the general category of "the 'good guy'
>>> gets
>>> fenced when the 'bad guy' gets into trouble." Examples:
>>>
>>> - Assuming I start out with two crashed nodes. If I just start up DRBD
>>> and
>>> nothing else, the partitions sync quickly with no problems.
>>>
>>> - If the system starts with cman running, and I start drbd, it's likely
>>> that
>>> system who is _not_ Outdated will be fenced (rebooted). Same thing if
>>> cman+pacemaker is running.
>>>
>>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
>>> (which
>>> is why I tried making changes to that resource script). Assume I start
>>> with one
>>> node running cman+pacemaker, and the other stopped. I turned on the
>>> stopped
>>> node. This will typically result in the running node being fenced,
>>> because
>>> it
>>> has it times out when stopping the exportfs resource.
>>>
>>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>>
>>> My pacemaker configuration is long, so I'll excerpt what I think are the
>>> relevant pieces in the hope that it will be enough for someone to say
>>> "You
>>> fool!
>>> This is covered in Pacemaker Explained page 56!" When bringing up a
>>> stopped
>>> node, in order to restart AdminClone pacemaker wants to stop
>>> ExportsClone,
>>> then
>>> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop
>>> ExportMail
>>> on
>>> the running node that causes it to be fenced.
>>>
>>> primitive AdminDrbd ocf:linbit:drbd \
>>>        params drbd_resource="admin" \
>>>        op monitor interval="60s" role="Master" \
>>>        op monitor interval="59s" role="Slave" \
>>>        op stop interval="0" timeout="320" \
>>>        op start interval="0" timeout="240"
>>> ms AdminClone AdminDrbd \
>>>        meta master-max="2" master-node-max="1" \
>>>        clone-max="2" clone-node-max="1" notify="true"
>>>
>>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>>> clone ClvmdClone Clvmd
>>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>>
>>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>>> clone Gfs2Clone Gfs2
>>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>>
>>> primitive ExportMail ocf:heartbeat:exportfs \
>>>        op start interval="0" timeout="40" \
>>>        op stop interval="0" timeout="45" \
>>>        params clientspec="mail" directory="/mail" fsid="30"
>>> clone ExportsClone ExportMail
>>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone
>>>
>>
>
> --
> Bill Seligman             | 
> mailto://seligman@nevis.**columbia.edu<selig...@nevis.columbia.edu>
> Nevis Labs, Columbia Univ | 
> http://www.nevis.columbia.edu/**~seligman/<http://www.nevis.columbia.edu/%7Eseligman/>
> PO Box 137                |
> Irvington NY 10533  USA   | Phone: (914) 591-2823
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

Reply via email to