I've just realised I have a classic split-brain with my CRM setup. I'm running pacemaker 1.1.6-2ubuntu0~ppa2 (installed from ubuntu-ha-maintainers-ppa-lucid) and heartbeat 1:3.0.5-3ubuntu0~ppa1 on Ubuntu Lucid. I have 3 IPAddr2, 3 SendArp, 3 MailTo resources set up on two servers (front ends running haproxy). This was all working fine, but I checked crm_mon today and found that each node shows the other as offline, and they are both publishing the same floating IPs simultaneously! Wierdly, everything still seems to be working! I can't see any reason for this - it was working fine previously and config has not changed: servers are up and running, firewall ports are open (each node allows UDP on port 694 from the other machine). crm_mon shows this:
============ Last updated: Tue Jun 26 16:28:23 2012 Last change: Tue Mar 27 22:19:17 2012 Stack: Heartbeat Current DC: proxy1.example.com (68890308-615b-4b28-bb8b-5aa00bdbf65c) - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 1 expected votes 10 Resources configured. ============ Online: [ proxy1.example.com ] OFFLINE: [ proxy2.example.com ] Resource Group: proxyfloat ip1 (ocf::heartbeat:IPaddr2): Started proxy1.example.com ip1arp (ocf::heartbeat:SendArp): Started proxy1.example.com ip1email (ocf::heartbeat:MailTo): Started proxy1.example.com Resource Group: proxyfloat2 ip2 (ocf::heartbeat:IPaddr2): Started proxy1.example.com ip2arp (ocf::heartbeat:SendArp): Started proxy1.example.com ip2email (ocf::heartbeat:MailTo): Started proxy1.example.com Resource Group: proxyfloat3 ip3 (ocf::heartbeat:IPaddr2): Started proxy1.example.com ip3arp (ocf::heartbeat:SendArp): Started proxy1.example.com ip3email (ocf::heartbeat:MailTo): Started proxy1.example.com ============ Last updated: Tue Jun 26 16:28:09 2012 Last change: Tue Mar 27 22:19:17 2012 Stack: Heartbeat Current DC: proxy2.example.com (30a5636b-26f6-4c31-9ea7-d4fb912ee624) - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 1 expected votes 10 Resources configured. ============ Online: [ proxy2.example.com ] OFFLINE: [ proxy1.example.com ] Resource Group: proxyfloat ip1 (ocf::heartbeat:IPaddr2): Started proxy2.example.com ip1arp (ocf::heartbeat:SendArp): Started proxy2.example.com ip1email (ocf::heartbeat:MailTo): Started proxy2.example.com Resource Group: proxyfloat2 ip2 (ocf::heartbeat:IPaddr2): Started proxy2.example.com ip2arp (ocf::heartbeat:SendArp): Started proxy2.example.com ip2email (ocf::heartbeat:MailTo): Started proxy2.example.com Resource Group: proxyfloat3 ip3 (ocf::heartbeat:IPaddr2): Started proxy2.example.com ip3arp (ocf::heartbeat:SendArp): Started proxy2.example.com ip3email (ocf::heartbeat:MailTo): Started proxy2.example.com Both servers are logging this sequence every 10 minutes or so: Jun 26 06:44:18 proxy1 crmd: [3205]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms) Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke: Query 1746: Requesting the current CIB: S_POLICY_ENGINE Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke_callback: Invoking the PE: query=1746, ref=pe_calc-dc-1340693058-1731, seq=3, quorate=1 Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_config: On loss of CCM Quorum: Ignore Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation ip2arp_last_failure_0 found resource ip2arp active on proxy1.example.com Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation ip1arp_last_failure_0 found resource ip1arp active on proxy1.example.com Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation ip3_last_failure_0 found resource ip3 active on proxy1.example.com Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation ip3arp_last_failure_0 found resource ip3arp active on proxy1.example.com Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave email_alert#011(Stopped) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip1#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip1arp#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip1email#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip2#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip2arp#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip2email#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip3#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip3arp#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 crmd: [3205]: info: unpack_graph: Unpacked transition 1653: 0 actions in 0 synapses Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave ip3email#011(Started proxy1.example.com) Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_te_invoke: Processing graph 1653 (ref=pe_calc-dc-1340693058-1731) derived from /var/lib/pengine/pe-input-35.bz2 Jun 26 06:44:18 proxy1 pengine: [3207]: notice: process_pe_message: Transition 1653: PEngine Input stored in: /var/lib/pengine/pe-input-35.bz2 Jun 26 06:44:18 proxy1 crmd: [3205]: info: run_graph: ==================================================== Jun 26 06:44:18 proxy1 crmd: [3205]: notice: run_graph: Transition 1653 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-35.bz2): Complete Jun 26 06:44:18 proxy1 crmd: [3205]: info: te_graph_trigger: Transition 1653 is now complete Jun 26 06:44:18 proxy1 crmd: [3205]: info: notify_crmd: Transition 1653 status: done - <null> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Starting PEngine Recheck Timer How can I diagnose why they are not talking to each other? Marcus _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems