[Linux-HA] Nodes not seeing each other

Marcus Bointon Tue, 26 Jun 2012 10:48:45 -0700

I've just realised I have a classic split-brain with my CRM setup. I'm running 
pacemaker 1.1.6-2ubuntu0~ppa2 (installed from ubuntu-ha-maintainers-ppa-lucid) 
and heartbeat 1:3.0.5-3ubuntu0~ppa1 on Ubuntu Lucid. I have 3 IPAddr2, 3 
SendArp, 3 MailTo resources set up on two servers (front ends running haproxy). 
This was all working fine, but I checked crm_mon today and found that each node 
shows the other as offline, and they are both publishing the same floating IPs 
simultaneously! Wierdly, everything still seems to be working!
I can't see any reason for this - it was working fine previously and config has 
not changed: servers are up and running, firewall ports are open (each node 
allows UDP on port 694 from the other machine). crm_mon shows this:


============
Last updated: Tue Jun 26 16:28:23 2012
Last change: Tue Mar 27 22:19:17 2012
Stack: Heartbeat
Current DC: proxy1.example.com (68890308-615b-4b28-bb8b-5aa00bdbf65c) - 
partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 1 expected votes
10 Resources configured.
============

Online: [ proxy1.example.com ]
OFFLINE: [ proxy2.example.com ]

 Resource Group: proxyfloat
     ip1        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
     ip1arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
     ip1email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
 Resource Group: proxyfloat2
     ip2        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
     ip2arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
     ip2email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
 Resource Group: proxyfloat3
     ip3        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
     ip3arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
     ip3email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
     
============
Last updated: Tue Jun 26 16:28:09 2012
Last change: Tue Mar 27 22:19:17 2012
Stack: Heartbeat
Current DC: proxy2.example.com (30a5636b-26f6-4c31-9ea7-d4fb912ee624) - 
partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 1 expected votes
10 Resources configured.
============

Online: [ proxy2.example.com ]
OFFLINE: [ proxy1.example.com ]

 Resource Group: proxyfloat
     ip1        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
     ip1arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
     ip1email   (ocf::heartbeat:MailTo):        Started proxy2.example.com
 Resource Group: proxyfloat2
     ip2        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
     ip2arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
     ip2email   (ocf::heartbeat:MailTo):        Started proxy2.example.com
 Resource Group: proxyfloat3
     ip3        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
     ip3arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
     ip3email   (ocf::heartbeat:MailTo):        Started proxy2.example.com

Both servers are logging this sequence every 10 minutes or so:

Jun 26 06:44:18 proxy1 crmd: [3205]: info: crm_timer_popped: PEngine Recheck 
Timer (I_PE_CALC) just popped (900000ms)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Progressed to 
state S_POLICY_ENGINE after C_TIMER_POPPED
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke: Query 1746: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke_callback: Invoking the 
PE: query=1746, ref=pe_calc-dc-1340693058-1731, seq=3, quorate=1
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
ip2arp_last_failure_0 found resource ip2arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
ip1arp_last_failure_0 found resource ip1arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
ip3_last_failure_0 found resource ip3 active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
ip3arp_last_failure_0 found resource ip3arp active on proxy1.example.com
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
email_alert#011(Stopped)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip1#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip1arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip1email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip2#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip2arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip2email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip3#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip3arp#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: unpack_graph: Unpacked transition 
1653: 0 actions in 0 synapses
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
ip3email#011(Started proxy1.example.com)
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_te_invoke: Processing graph 1653 
(ref=pe_calc-dc-1340693058-1731) derived from /var/lib/pengine/pe-input-35.bz2
Jun 26 06:44:18 proxy1 pengine: [3207]: notice: process_pe_message: Transition 
1653: PEngine Input stored in: /var/lib/pengine/pe-input-35.bz2
Jun 26 06:44:18 proxy1 crmd: [3205]: info: run_graph: 
====================================================
Jun 26 06:44:18 proxy1 crmd: [3205]: notice: run_graph: Transition 1653 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-35.bz2): Complete
Jun 26 06:44:18 proxy1 crmd: [3205]: info: te_graph_trigger: Transition 1653 is 
now complete
Jun 26 06:44:18 proxy1 crmd: [3205]: info: notify_crmd: Transition 1653 status: 
done - <null>
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Starting 
PEngine Recheck Timer

How can I diagnose why they are not talking to each other?

Marcus
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Nodes not seeing each other

Reply via email to