Re: [Linux-HA] critical standby problem

Itay Donenhirsch Mon, 15 Sep 2008 06:24:38 -0700

On Mon, Sep 15, 2008 at 7:34 AM, Andrew Beekhof <[EMAIL PROTECTED]> wrote:


> Your stonith agent modifications seem to have broken something as the
> stonith agents are not able to start up (rc=6) under some conditions.


Well, apparently something is very wrong. But what are these "some
conditions"? Where can I find documentation for the said "condition"s?


>
> On Thu, Sep 11, 2008 at 19:50, Itay Donenhirsch <[EMAIL PROTECTED]>
> wrote:
> > Hi all,
> >
> > I have a problem that begins to wreck my nerves for several days now: I
> > start out from a 4-nodes cluster, stations are
> > ibp1-105...ibp3-105,ibp-standby-105. All is nice and cozy until all hell
> > breaks loose. It happens after a few ethernet cable pull-outs and ins.
> The
> > results are not deterministic but result usually in a split-brain-like
> > situations, or double allocation of resources (two stations got the same
> > resource). There are many warnings that accompany this situation, but I
> can
> > not make much of them.
> >
> > I'm working with heartbeat 2.1.4 (no, not using pacemaker!).
> >
> > You can get the logs and all the vital stats from
> > http://itay.bazoo.org/problem.tar.gz (~400kb).  (incl. cibs, conf, logs,
> > crm_mons, crm_verify)
> >
> > *** crm_mon on ibp1-105:
> >
> > ============
> > Last updated: Thu Sep 11 20:40:28 2008
> > Current DC: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037)
> > 4 Nodes configured.
> > 7 Resources configured.
> > ============
> >
> > Node: ibp-standby-105 (049722ba-19df-43a8-a73f-3f3d69eb332f): standby
> > Node: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037): online
> >        ibp2-105_stonith:1      (stonith:external/qod-ipmi)
> >        ibp3_mgmt_ip    (ocf::heartbeat:IPaddr2)
> >        ibp3_qod_ha_process     (lsb:qod-ha)
> >        ibp1-105_stonith:2      (stonith:external/qod-ipmi)
> >        ibp-standby-105_stonith:2       (stonith:external/qod-ipmi)
> >        ibp3_data0_ip   (ocf::heartbeat:IPaddr2)
> > Node: ibp2-105 (b6a94dfe-a247-48fb-a008-556e8598f3e0): online
> >        ibp2_data0_ip   (ocf::heartbeat:IPaddr2)
> >        ibp2_qod_ha_process     (lsb:qod-ha)
> >        ibp2_mgmt_ip    (ocf::heartbeat:IPaddr2)
> >        ibp3-105_stonith:0      (stonith:external/qod-ipmi)
> > Node: ibp1-105 (f64f0bd8-e86e-40c4-8299-fc0bd2239d75): standby
> >
> > Failed actions:
> >    ibp1-105_stonith:0_start_0 (node=ibp1-105, call=30, rc=6): complete
> >    ibp1-105_stonith:1_start_0 (node=ibp1-105, call=28, rc=6): complete
> >    ibp-standby-105_stonith:0_start_0 (node=ibp-standby-105, call=26,
> rc=6):
> > complete
> >    ibp-standby-105_stonith:1_start_0 (node=ibp-standby-105, call=29,
> rc=6):
> > complete
> >    ibp2-105_stonith:0_start_0 (node=ibp2-105, call=24, rc=6): complete
> >    ibp2-105_stonith:2_start_0 (node=ibp2-105, call=28, rc=6): complete
> >    ibp3-105_stonith:1_start_0 (node=ibp3-105, call=24, rc=6): complete
> >    ibp3-105_stonith:2_start_0 (node=ibp3-105, call=35, rc=6): complete
> >
> > *** Endless repeatitions in the logs in ibp1-105 of:
> >
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
> > determine_online_status: Node ibp-standby-105 is unclean
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
> > native_add_running: Resource
> stonith::external/qod-ipmi:ibp1-105_stonith:0
> > appears to be active on 2 nodes.
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: See
> > http://linux-ha.org/v2/faq/resource_too_active for more information.
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
> unpack_rsc_op:
> > Hard error: ibp-standby-105_stonith:0_start_0 failed with rc=6.
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
> > unpack_rsc_op:   Preventing ibp-standby-105_stonith:0 from re-starting
> > anywhere in the cluster
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
> unpack_rsc_op:
> > Processing failed op ibp-standby-105_stonith:0_start_0 on
> ibp-standby-105:
> > Error
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
> unpack_rsc_op:
> > Compatability handling for failed op ibp-standby-105_stonith:0_start_0 on
> > ibp-standby-105
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
> unpack_rsc_op:
> > Hard error: ibp-standby-105_stonith:1_start_0 failed with rc=6.
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
> > unpack_rsc_op:   Preventing ibp-standby-105_stonith:1 from re-starting
> > anywhere in the cluster
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
> unpack_rsc_op:
> > Processing failed op ibp-standby-105_stonith:1_start_0 on
> ibp-standby-105:
> > Error
> > Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
> unpack_rsc_op:
> > Compatability handling for failed op ibp-standby-105_stonith:1_start_0 on
> > ibp-standby-105
> >
> > *** And more endless repeatitions in the logs in ibp1-standby-105 of:
> > Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
> > maximum length(500) exceeded
> > Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
> > maximum length(500) exceeded
> > Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
> > maximum length(500) exceeded
> > Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
> > maximum length(500) exceeded
> > Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
> > maximum length(500) exceeded
> >
> > Another thing (maybe it's related?) that I noticed: I'm using Dells with
> > DRAC5 and trying to stonith them with external/ipmi. It seems that this
> > fails when the stations are powers down, a thing that violates rule 4
> > according to http://www.linux-ha.org/STONITH. Therefore I made a variant
> of
> > the script and named it "qod-ipmi". It is attached in the mentioned
> tarball
> > as well.
> >
> > I'll be grateful for any advice, and as soon as possbile I hope...
> >
> > Thanks,
> > Itay
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] critical standby problem

Reply via email to