Re: [Pacemaker] Could not connect to the CIB: Remote node did notrespond

Liang.Ma Thu, 10 Feb 2011 14:00:30 -0800

Hi,

Now I took one node off by /etc/init.d/heartbeat stop.


With one node arsvr1 online, heartbeat tries to respan crmd, but ends with an 
error code 2.

Here are the logs:

Feb 10 16:37:10 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_started ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: crm_timer_popped: Election Trigger 
(I_DC_TIMEOUT) just popped!
Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: do_log: FSA: Input I_DC_TIMEOUT from 
crm_timer_popped() received in state S_PENDING
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_te_control: Registering TE UUID: 
c173d324-3b4f-445b-850f-f3406cc116ac
Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: cib_client_add_notify_callback: 
Callback already present
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: set_graph_functions: Setting custom 
graph functions
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: unpack_graph: Unpacked transition 
-1: 0 actions in 0 synapses
Feb 10 16:38:11 arsvr1 crmd: [5251]: info: start_subsystem: Starting sub-system 
"pengine"
Feb 10 16:38:11 arsvr1 pengine: [5253]: info: Invoked: 
/usr/lib/heartbeat/pengine 
Feb 10 16:38:11 arsvr1 pengine: [5253]: info: main: Starting pengine
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_takeover: Taking over DC 
status for this partition
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in 
R/W mode
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_master for section 'all' (origin=local/crmd/6, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section cib (origin=local/crmd/7, version=0.298.3): 
ok (rc=0)
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/9, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: join_make_offer: Making join offers 
based on membership 1
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_join_offer_all: join-1: 
Waiting on 1 outstanding join acks
Feb 10 16:38:14 arsvr1 crmd: [5251]: info: te_connect_stonith: Attempting 
connection to fencing daemon...
Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/11, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: te_connect_stonith: Connected
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: config_query_callback: Checking for 
expired actions every 900000ms
Feb 10 16:38:15 arsvr1 crmd: [5251]: info: update_dc: Set DC to arsvr1 (3.0.1)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster 
nodes responded to the join offer.
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_finalize: join-1: Syncing 
the CIB from arsvr1 to the rest of the cluster
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_sync for section 'all' (origin=local/crmd/14, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section nodes (origin=local/crmd/15, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: update_attrd: Connecting to attrd...
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_delete for section 
//node_state[@uname='arsvr1']/transient_attributes (origin=local/crmd/16, 
version=0.298.3): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of 
"//node_state[@uname='arsvr1']/transient_attributes": ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_ack: join-1: Updating 
node state to member for arsvr1
Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_delete for section //node_state[@uname='arsvr1']/lrm 
(origin=local/crmd/17, version=0.298.4): ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of 
"//node_state[@uname='arsvr1']/lrm": ok (rc=0)
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Feb 10 16:38:16 arsvr1 crmd: [5251]: info: populate_cib_nodes_ha: Requesting 
the list of configured nodes
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: get_uuid: Could not calculate UUID 
for arsvr2
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: populate_cib_nodes_ha: Node arsvr2: 
no uuid found
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_local_callback: Sending full 
refresh (origin=crmd)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush 
op to all hosts for: shutdown (<null>)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_join_final: Ensuring DC, 
quorum and node attributes are up-to-date
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crm_update_quorum: Updating quorum 
status to true (call=21)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: abort_transition_graph: 
do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke: Query 22: Requesting 
the current CIB: S_POLICY_ENGINE
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section nodes (origin=local/crmd/19, 
version=0.298.5): ok (rc=0)
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation 
complete: op cib_modify for section cib (origin=local/crmd/21, 
version=0.298.5): ok (rc=0)
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush 
op to all hosts for: terminate (<null>)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke_callback: Invoking the 
PE: query=22, ref=pe_calc-dc-1297373897-7, seq=1, quorate=1
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: unpack_config: Node scores: 'red' 
= -INFINITY, 'yellow' = 0, 'green' = 0
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: determine_online_status: Node 
arsvr1 is online
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print:  Resource Group: 
MySQLDB
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
fs_mysql#011(ocf::heartbeat:Filesystem):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
mysql#011(ocf::heartbeat:mysql):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print:  Master/Slave Set: 
ms_drbd_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print:      Stopped: [ 
drbd_mysql:0 drbd_mysql:1 ]
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print:  Master/Slave Set: 
ms_drbd_webfs
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print:      Stopped: [ 
drbd_webfs:0 drbd_webfs:1 ]
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print:  Resource Group: 
WebServices
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
ip1#011(ocf::heartbeat:IPaddr2):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
ip1arp#011(ocf::heartbeat:SendArp):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
fs_webfs#011(ocf::heartbeat:Filesystem):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print:      
apache2#011(lsb:apache2):#011Stopped 
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: 
ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: 
ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_color: Resource 
drbd_mysql:1 cannot run anywhere
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: 
ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: 
ms_drbd_mysql: Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: master_color: ms_drbd_mysql: 
Promoted 0 instances of a possible 1 to master
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: 
Rolling back scores from fs_webfs
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: 
Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: 
Rolling back scores from fs_mysql
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: 
Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: 
Rolling back scores from ip1
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: Managed pengine process 5253 killed 
by signal 11 [SIGSEGV - Segmentation violation].
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: Managed pengine process 5253 dumped 
core
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmdManagedChildDied: Process 
pengine:[5253] exited (signal=11, exitcode=0)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: pe_msg_dispatch: Received HUP from 
pengine:[5253]
Feb 10 16:38:17 arsvr1 crmd: [5251]: CRIT: pe_connection_destroy: Connection to 
the Policy Engine failed (pid=5253, uuid=679c316a-ec3c-4344-8b45-47d3e6e73fb0)
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message 
from arsvr1
Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message 
from arsvr1
Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: save_cib_contents: Saved CIB 
contents after PE crash to 
/var/lib/pengine/pe-core-679c316a-ec3c-4344-8b45-47d3e6e73fb0.bz2
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_ERROR from 
save_cib_contents() received in state S_POLICY_ENGINE
Feb 10 16:38:17 arsvr1 ccm: [5115]: info: client (pid=5251) removed from ccm
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL 
origin=save_cib_contents ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_recover: Action A_RECOVER 
(0000000001000000) not supported
Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: do_election_vote: Not voting in 
election, we're in state S_RECOVERY
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_release: DC role released
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Transitioner is now 
inactive
Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in 
R/O mode
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Disconnecting 
STONITH...
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: WARN: Managed /usr/lib/heartbeat/crmd 
process 5251 exited with return code 2.
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_ipc_message: IPC Channel to 5251 
is not connected
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: tengine_stonith_connection_destroy: 
Fencing daemon disconnected
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: ERROR: Respawning client 
"/usr/lib/heartbeat/crmd":
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_via_callback_channel: Delivery 
of reply to client 5251/dffbb159-0075-4af5-9767-eda4efff2658 failed
Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: Not currently connected.
Feb 10 16:38:17 arsvr1 heartbeat: [5014]: info: Starting child client 
"/usr/lib/heartbeat/crmd" (107,117)
Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: do_local_notify: A-Sync reply to crmd 
failed: reply failed
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_TERMINATE from 
do_recover() received in state S_RECOVERY
Feb 10 16:38:17 arsvr1 heartbeat: [5254]: info: Starting 
"/usr/lib/heartbeat/crmd" as uid 107  gid 117 (pid 5254)
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State 
transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL 
origin=do_recover ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_shutdown: All subsystems stopped, 
continuing
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_lrm_control: Disconnected from 
the LRM
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_ha_control: Disconnected from 
Heartbeat
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_cib_control: Disconnecting CIB
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmd_cib_connection_destroy: 
Connection to the CIB terminated...
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: Performing A_EXIT_0 - 
gracefully exiting the CRMd
Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_exit: Could not recover from 
internal error
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_PENDING: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_election_vote ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping 
I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL 
origin=do_dc_release ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_TERMINATE: [ 
state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: [crmd] stopped (2)
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: Invoked: /usr/lib/heartbeat/crmd 
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: main: CRM Hg Version: 
042548a451fce8400660f6031f4da6f0223dd5dd
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: do_cib_control: CIB connection 
established
Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crm_cluster_connect: Connecting to 
Heartbeat
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: Hostname: 
arsvr1
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: UUID: 
bf0e7394-9684-42b9-893b-5a9a6ecddd7e
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ha_control: Connected to the 
cluster
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ccm_control: CCM connection 
established... waiting for first callback
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_started: Delaying start, CCM 
(0000000000100000) not connected
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd's mainloop
Feb 10 16:38:18 arsvr1 crmd: [5254]: info: config_query_callback: Checking for 
expired actions every 900000ms
Feb 10 16:38:18 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: 
Status update: Client arsvr1/crmd now has status [online] (DC=false)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_new_peer: Node 0 is now known as 
arsvr1
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.crmd is 
now online
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the 
DC
Feb 10 16:38:19 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: 
Status update: Client arsvr1/crmd now has status [online] (DC=false)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the 
DC
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: instance=1, 
nodes=1, new=1, lost=0, n_idx=0, new_idx=0, old_idx=3
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=1)
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=1, nodes=1, new=1, lost=0 n_idx=0, new_idx=0, old_idx=3
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011CURRENT: 
arsvr1 [nodeid=0, born=1]
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011NEW:     
arsvr1 [nodeid=0, born=1]
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer: Node arsvr1: id=0 
state=member (new) addr=(null) votes=-1 born=1 seen=1 
proc=00000000000000000000000000000200
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.ais is 
now online
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_started: The local CRM is 
operational
Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_state_transition: State 
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_started ]

Liang Ma
Contractuel | Consultant | SED Systems Inc. 
Ground Systems Analyst
Agence spatiale canadienne | Canadian Space Agency
6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9
Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083
Courriel/E-mail : [liang...@space.gc.ca]
Site web/Web site : [www.space.gc.ca ] 



-----Original Message-----
From: Ma, Liang 
Sent: February 10, 2011 9:08 AM
To: The Pacemaker cluster resource manager
Subject: RE: [Pacemaker] Could not connect to the CIB: Remote node did 
notrespond

Thanks Andrew.

Yes, cibadmin -Ql works, but cibadmin -Q not.

What is DC?

And here is the logs.

Feb 10 08:57:30 arsvr1 cibadmin: [4264]: info: Invoked: cibadmin -Ql 
Feb 10 08:57:32 arsvr1 cibadmin: [4265]: info: Invoked: cibadmin -Q 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_state_transition: State transition 
S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_election_count_vote ] 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_dc_release: DC role released 
Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_te_control: Transitioner is now 
inactive 
Feb 10 08:58:08 arsvr1 crmd: [960]: info: update_dc: Set DC to arsvr2 (3.0.1) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_local_callback:Sending full 
refresh (origin=crmd)
Feb 10 08:58:10 arsvr1 crmd: [960]: info: do_state_transition: State transition 
S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
origin=do_cl_join_finalize_respond ] 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush 
op to all hosts for: shutdown (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush 
op to all hosts for: master-drbd_mysql:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush 
op to all hosts for: terminate (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush 
op to all hosts for: master-drbd_webfs:0 (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush 
op to all hosts for: probe_complete (<null>) 
Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message 
from arsvr1 
Feb 10 08:58:12 arsvr1 attrd: last message repeated 4 times 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message 
from arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback:flush message from 
arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]:notice:crmd_client_status_callback: Status 
update: Client arsvr2/crmd now has status [offline] (DC=false) 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message 
from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is 
now offline
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message 
from arsvr2 
Feb 10 08:58:12 arsvr1 crmd: [960]: info:crmd_client_status_callback:Got client 
status callback - our DC is dead 
Feb 10 08:58:12 arsvr1 crmd: [960]: notice:crmd_client_status_callback: Status 
update: Client arsvr2/crmd now has status [online] (DC=false) 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is 
now online
Feb 10 08:58:12 arsvr1 crmd: [960]: info: crmd_client_status_callback:Not the DC
Feb 10 08:58:12 arsvr1 crmd: [960]: info: do_state_transition: State transition 
S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK 
origin=crmd_client_status_callback ] 
Feb 10 08:58:12 arsvr1 crmd: [960]: info: update_dc: Unset DC arsvr2 
Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message 
from arsvr2 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: WARN: 1 lost packet(s) for [arsvr2] 
[131787:131789] 
Feb 10 08:58:14 arsvr1 heartbeat: [898]: info: No pkts missing from arsvr2!


Liang Ma
Contractuel | Consultant | SED Systems Inc. 
Ground Systems Analyst
Agence spatiale canadienne | Canadian Space Agency
6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9
Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083
Courriel/E-mail : [liang...@space.gc.ca]
Site web/Web site : [www.space.gc.ca ] 




-----Original Message-----
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: February 10, 2011 2:39 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Could not connect to the CIB: Remote node did 
notrespond

On Wed, Feb 9, 2011 at 3:59 PM,  <liang...@asc-csa.gc.ca> wrote:
> Hi There,
>
> After a network and power shutdown, my LAMP cluster servers were totally 
> screwed up.
>
> Now crm status gives me
>
> crm status
> ============
> Last updated: Wed Feb  9 09:44:17 2011
> Stack: Heartbeat
> Current DC: arsvr2 (bc6bf61d-6b5f-4307-85f3-bf7bb11531bb) - partition with 
> quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 1 expected votes
> 4 Resources configured.
> ============
>
> Online: [ arsvr1 arsvr2 ]
>
> None of the resources comes up.
>
> First I found a brain split in drbd disks. I fixed that and the drbd disks 
> are health. I can mount them manually without problem.
>
> However if I try anything to bring up a resource or edit cib or even a query, 
> it gives me errors as following
>
> crm resource start fs_mysql
> Call cib_replace failed (-41): Remote node did not respond <null>
>
> crm configure edit
> Could not connect to the CIB: Remote node did not respond
> ERROR: creating tmp shadow __crmshell.2540 failed
>
>
> cibadmin -Q
> Call cib_query failed (-41): Remote node did not respond <null>
>
> Any idea what I can do to bring the cluster back?

Seems like you don't have a DC.
Hard to say why without logs.

Does cibadmin -Ql work?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Could not connect to the CIB: Remote node did notrespond

Reply via email to