Hi all, I have a problem that begins to wreck my nerves for several days now: I start out from a 4-nodes cluster, stations are ibp1-105...ibp3-105,ibp-standby-105. All is nice and cozy until all hell breaks loose. It happens after a few ethernet cable pull-outs and ins. The results are not deterministic but result usually in a split-brain-like situations, or double allocation of resources (two stations got the same resource). There are many warnings that accompany this situation, but I can not make much of them.
I'm working with heartbeat 2.1.4 (no, not using pacemaker!). You can get the logs and all the vital stats from http://itay.bazoo.org/problem.tar.gz (~400kb). (incl. cibs, conf, logs, crm_mons, crm_verify) *** crm_mon on ibp1-105: ============ Last updated: Thu Sep 11 20:40:28 2008 Current DC: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037) 4 Nodes configured. 7 Resources configured. ============ Node: ibp-standby-105 (049722ba-19df-43a8-a73f-3f3d69eb332f): standby Node: ibp3-105 (534b8ee0-d476-48ff-806b-5301b2a45037): online ibp2-105_stonith:1 (stonith:external/qod-ipmi) ibp3_mgmt_ip (ocf::heartbeat:IPaddr2) ibp3_qod_ha_process (lsb:qod-ha) ibp1-105_stonith:2 (stonith:external/qod-ipmi) ibp-standby-105_stonith:2 (stonith:external/qod-ipmi) ibp3_data0_ip (ocf::heartbeat:IPaddr2) Node: ibp2-105 (b6a94dfe-a247-48fb-a008-556e8598f3e0): online ibp2_data0_ip (ocf::heartbeat:IPaddr2) ibp2_qod_ha_process (lsb:qod-ha) ibp2_mgmt_ip (ocf::heartbeat:IPaddr2) ibp3-105_stonith:0 (stonith:external/qod-ipmi) Node: ibp1-105 (f64f0bd8-e86e-40c4-8299-fc0bd2239d75): standby Failed actions: ibp1-105_stonith:0_start_0 (node=ibp1-105, call=30, rc=6): complete ibp1-105_stonith:1_start_0 (node=ibp1-105, call=28, rc=6): complete ibp-standby-105_stonith:0_start_0 (node=ibp-standby-105, call=26, rc=6): complete ibp-standby-105_stonith:1_start_0 (node=ibp-standby-105, call=29, rc=6): complete ibp2-105_stonith:0_start_0 (node=ibp2-105, call=24, rc=6): complete ibp2-105_stonith:2_start_0 (node=ibp2-105, call=28, rc=6): complete ibp3-105_stonith:1_start_0 (node=ibp3-105, call=24, rc=6): complete ibp3-105_stonith:2_start_0 (node=ibp3-105, call=35, rc=6): complete *** Endless repeatitions in the logs in ibp1-105 of: Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: determine_online_status: Node ibp-standby-105 is unclean Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: native_add_running: Resource stonith::external/qod-ipmi:ibp1-105_stonith:0 appears to be active on 2 nodes. Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op: Hard error: ibp-standby-105_stonith:0_start_0 failed with rc=6. Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op: Preventing ibp-standby-105_stonith:0 from re-starting anywhere in the cluster Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op: Processing failed op ibp-standby-105_stonith:0_start_0 on ibp-standby-105: Error Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op: Compatability handling for failed op ibp-standby-105_stonith:0_start_0 on ibp-standby-105 Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op: Hard error: ibp-standby-105_stonith:1_start_0 failed with rc=6. Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: ERROR: unpack_rsc_op: Preventing ibp-standby-105_stonith:1 from re-starting anywhere in the cluster Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op: Processing failed op ibp-standby-105_stonith:1_start_0 on ibp-standby-105: Error Sep 11 19:56:35 [EMAIL PROTECTED] crm_resource: [25821]: WARN: unpack_rsc_op: Compatability handling for failed op ibp-standby-105_stonith:1_start_0 on ibp-standby-105 *** And more endless repeatitions in the logs in ibp1-standby-105 of: Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue maximum length(500) exceeded Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue maximum length(500) exceeded Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue maximum length(500) exceeded Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue maximum length(500) exceeded Sep 11 19:56:39 [EMAIL PROTECTED] pengine: [29098]: WARN: send queue maximum length(500) exceeded Another thing (maybe it's related?) that I noticed: I'm using Dells with DRAC5 and trying to stonith them with external/ipmi. It seems that this fails when the stations are powers down, a thing that violates rule 4 according to http://www.linux-ha.org/STONITH. Therefore I made a variant of the script and named it "qod-ipmi". It is attached in the mentioned tarball as well. I'll be grateful for any advice, and as soon as possbile I hope... Thanks, Itay _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
