Hi all,

I have a problem that begins to wreck my nerves for several days now: I
start out from a 4-nodes cluster, stations are
ibp1-105...ibp3-105,ibp-standby-105. All is nice and cozy until all hell
breaks loose. It happens after a few ethernet cable pull-outs and ins. The
results are not deterministic but result usually in a split-brain-like
situations, or double allocation of resources (two stations got the same
resource). There are many warnings that accompany this situation, but I can
not make much of them.

I'm working with heartbeat 2.1.4 (no, not using pacemaker!).

You can get the logs and all the vital stats from
http://itay.bazoo.org/problem.tar.gz (~400kb).  (incl. cibs, conf, logs,
crm_mons, crm_verify)

*** crm_mon on ibp1-105:

============
Last updated: Thu Sep 11 20:40:28 2008
Current DC: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037)
4 Nodes configured.
7 Resources configured.
============

Node: ibp-standby-105 (049722ba-19df-43a8-a73f-3f3d69eb332f): standby
Node: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037): online
        ibp2-105_stonith:1      (stonith:external/qod-ipmi)
        ibp3_mgmt_ip    (ocf::heartbeat:IPaddr2)
        ibp3_qod_ha_process     (lsb:qod-ha)
        ibp1-105_stonith:2      (stonith:external/qod-ipmi)
        ibp-standby-105_stonith:2       (stonith:external/qod-ipmi)
        ibp3_data0_ip   (ocf::heartbeat:IPaddr2)
Node: ibp2-105 (b6a94dfe-a247-48fb-a008-556e8598f3e0): online
        ibp2_data0_ip   (ocf::heartbeat:IPaddr2)
        ibp2_qod_ha_process     (lsb:qod-ha)
        ibp2_mgmt_ip    (ocf::heartbeat:IPaddr2)
        ibp3-105_stonith:0      (stonith:external/qod-ipmi)
Node: ibp1-105 (f64f0bd8-e86e-40c4-8299-fc0bd2239d75): standby

Failed actions:
    ibp1-105_stonith:0_start_0 (node=ibp1-105, call=30, rc=6): complete
    ibp1-105_stonith:1_start_0 (node=ibp1-105, call=28, rc=6): complete
    ibp-standby-105_stonith:0_start_0 (node=ibp-standby-105, call=26, rc=6):
complete
    ibp-standby-105_stonith:1_start_0 (node=ibp-standby-105, call=29, rc=6):
complete
    ibp2-105_stonith:0_start_0 (node=ibp2-105, call=24, rc=6): complete
    ibp2-105_stonith:2_start_0 (node=ibp2-105, call=28, rc=6): complete
    ibp3-105_stonith:1_start_0 (node=ibp3-105, call=24, rc=6): complete
    ibp3-105_stonith:2_start_0 (node=ibp3-105, call=35, rc=6): complete

*** Endless repeatitions in the logs in ibp1-105 of:

Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN:
determine_online_status: Node ibp-standby-105 is unclean
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
native_add_running: Resource stonith::external/qod-ipmi:ibp1-105_stonith:0
appears to be active on 2 nodes.
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: See
http://linux-ha.org/v2/faq/resource_too_active for more information.
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op:
Hard error: ibp-standby-105_stonith:0_start_0 failed with rc=6.
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
unpack_rsc_op:   Preventing ibp-standby-105_stonith:0 from re-starting
anywhere in the cluster
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op:
Processing failed op ibp-standby-105_stonith:0_start_0 on ibp-standby-105:
Error
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op:
Compatability handling for failed op ibp-standby-105_stonith:0_start_0 on
ibp-standby-105
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op:
Hard error: ibp-standby-105_stonith:1_start_0 failed with rc=6.
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR:
unpack_rsc_op:   Preventing ibp-standby-105_stonith:1 from re-starting
anywhere in the cluster
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op:
Processing failed op ibp-standby-105_stonith:1_start_0 on ibp-standby-105:
Error
Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op:
Compatability handling for failed op ibp-standby-105_stonith:1_start_0 on
ibp-standby-105

*** And more endless repeatitions in the logs in ibp1-standby-105 of:
Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
maximum length(500) exceeded
Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
maximum length(500) exceeded
Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
maximum length(500) exceeded
Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
maximum length(500) exceeded
Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue
maximum length(500) exceeded

Another thing (maybe it's related?) that I noticed: I'm using Dells with
DRAC5 and trying to stonith them with external/ipmi. It seems that this
fails when the stations are powers down, a thing that violates rule 4
according to http://www.linux-ha.org/STONITH. Therefore I made a variant of
the script and named it "qod-ipmi". It is attached in the mentioned tarball
as well.

I'll be grateful for any advice, and as soon as possbile I hope...

Thanks,
Itay
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to