Hi, Now I took one node off by /etc/init.d/heartbeat stop.
With one node arsvr1 online, heartbeat tries to respan crmd, but ends with an error code 2. Here are the logs: Feb 10 16:37:10 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Feb 10 16:38:11 arsvr1 crmd: [5251]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped! Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ] Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Feb 10 16:38:11 arsvr1 crmd: [5251]: info: do_te_control: Registering TE UUID: c173d324-3b4f-445b-850f-f3406cc116ac Feb 10 16:38:11 arsvr1 crmd: [5251]: WARN: cib_client_add_notify_callback: Callback already present Feb 10 16:38:11 arsvr1 crmd: [5251]: info: set_graph_functions: Setting custom graph functions Feb 10 16:38:11 arsvr1 crmd: [5251]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses Feb 10 16:38:11 arsvr1 crmd: [5251]: info: start_subsystem: Starting sub-system "pengine" Feb 10 16:38:11 arsvr1 pengine: [5253]: info: Invoked: /usr/lib/heartbeat/pengine Feb 10 16:38:11 arsvr1 pengine: [5253]: info: main: Starting pengine Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_takeover: Taking over DC status for this partition Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in R/W mode Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/6, version=0.298.3): ok (rc=0) Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/7, version=0.298.3): ok (rc=0) Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/9, version=0.298.3): ok (rc=0) Feb 10 16:38:14 arsvr1 crmd: [5251]: info: join_make_offer: Making join offers based on membership 1 Feb 10 16:38:14 arsvr1 crmd: [5251]: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks Feb 10 16:38:14 arsvr1 crmd: [5251]: info: te_connect_stonith: Attempting connection to fencing daemon... Feb 10 16:38:14 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/11, version=0.298.3): ok (rc=0) Feb 10 16:38:15 arsvr1 crmd: [5251]: info: te_connect_stonith: Connected Feb 10 16:38:15 arsvr1 crmd: [5251]: info: config_query_callback: Checking for expired actions every 900000ms Feb 10 16:38:15 arsvr1 crmd: [5251]: info: update_dc: Set DC to arsvr1 (3.0.1) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster nodes responded to the join offer. Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_finalize: join-1: Syncing the CIB from arsvr1 to the rest of the cluster Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/14, version=0.298.3): ok (rc=0) Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/15, version=0.298.3): ok (rc=0) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: update_attrd: Connecting to attrd... Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='arsvr1']/transient_attributes (origin=local/crmd/16, version=0.298.3): ok (rc=0) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of "//node_state[@uname='arsvr1']/transient_attributes": ok (rc=0) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_dc_join_ack: join-1: Updating node state to member for arsvr1 Feb 10 16:38:16 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='arsvr1']/lrm (origin=local/crmd/17, version=0.298.4): ok (rc=0) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: erase_xpath_callback: Deletion of "//node_state[@uname='arsvr1']/lrm": ok (rc=0) Feb 10 16:38:16 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ] Feb 10 16:38:16 arsvr1 crmd: [5251]: info: populate_cib_nodes_ha: Requesting the list of configured nodes Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: get_uuid: Could not calculate UUID for arsvr2 Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: populate_cib_nodes_ha: Node arsvr2: no uuid found Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_local_callback: Sending full refresh (origin=crmd) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crm_update_quorum: Updating quorum status to true (call=21) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: abort_transition_graph: do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke: Query 22: Requesting the current CIB: S_POLICY_ENGINE Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/19, version=0.298.5): ok (rc=0) Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/21, version=0.298.5): ok (rc=0) Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_pe_invoke_callback: Invoking the PE: query=22, ref=pe_calc-dc-1297373897-7, seq=1, quorate=1 Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: unpack_config: On loss of CCM Quorum: Ignore Feb 10 16:38:17 arsvr1 pengine: [5253]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Feb 10 16:38:17 arsvr1 pengine: [5253]: info: determine_online_status: Node arsvr1 is online Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print: Resource Group: MySQLDB Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: fs_mysql#011(ocf::heartbeat:Filesystem):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: mysql#011(ocf::heartbeat:mysql):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print: Master/Slave Set: ms_drbd_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print: Stopped: [ drbd_mysql:0 drbd_mysql:1 ] Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: clone_print: Master/Slave Set: ms_drbd_webfs Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: short_print: Stopped: [ drbd_webfs:0 drbd_webfs:1 ] Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: group_print: Resource Group: WebServices Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: ip1#011(ocf::heartbeat:IPaddr2):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: ip1arp#011(ocf::heartbeat:SendArp):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: fs_webfs#011(ocf::heartbeat:Filesystem):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: notice: native_print: apache2#011(lsb:apache2):#011Stopped Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_color: Resource drbd_mysql:1 cannot run anywhere Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ms_drbd_mysql: Rolling back scores from fs_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: info: master_color: ms_drbd_mysql: Promoted 0 instances of a possible 1 to master Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from fs_webfs Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1 Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from fs_mysql Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1 Feb 10 16:38:17 arsvr1 pengine: [5253]: info: native_merge_weights: ip1arp: Rolling back scores from ip1 Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: Managed pengine process 5253 killed by signal 11 [SIGSEGV - Segmentation violation]. Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: Managed pengine process 5253 dumped core Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmdManagedChildDied: Process pengine:[5253] exited (signal=11, exitcode=0) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: pe_msg_dispatch: Received HUP from pengine:[5253] Feb 10 16:38:17 arsvr1 crmd: [5251]: CRIT: pe_connection_destroy: Connection to the Policy Engine failed (pid=5253, uuid=679c316a-ec3c-4344-8b45-47d3e6e73fb0) Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message from arsvr1 Feb 10 16:38:17 arsvr1 attrd: [5119]: info: attrd_ha_callback: flush message from arsvr1 Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pengine/pe-core-679c316a-ec3c-4344-8b45-47d3e6e73fb0.bz2 Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE Feb 10 16:38:17 arsvr1 ccm: [5115]: info: client (pid=5251) removed from ccm Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ] Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported Feb 10 16:38:17 arsvr1 crmd: [5251]: WARN: do_election_vote: Not voting in election, we're in state S_RECOVERY Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_dc_release: DC role released Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Transitioner is now inactive Feb 10 16:38:17 arsvr1 cib: [5116]: info: cib_process_readwrite: We are now in R/O mode Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_te_control: Disconnecting STONITH... Feb 10 16:38:17 arsvr1 heartbeat: [5014]: WARN: Managed /usr/lib/heartbeat/crmd process 5251 exited with return code 2. Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_ipc_message: IPC Channel to 5251 is not connected Feb 10 16:38:17 arsvr1 crmd: [5251]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected Feb 10 16:38:17 arsvr1 heartbeat: [5014]: ERROR: Respawning client "/usr/lib/heartbeat/crmd": Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: send_via_callback_channel: Delivery of reply to client 5251/dffbb159-0075-4af5-9767-eda4efff2658 failed Feb 10 16:38:17 arsvr1 crmd: [5251]: notice: Not currently connected. Feb 10 16:38:17 arsvr1 heartbeat: [5014]: info: Starting child client "/usr/lib/heartbeat/crmd" (107,117) Feb 10 16:38:17 arsvr1 cib: [5116]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Feb 10 16:38:17 arsvr1 heartbeat: [5254]: info: Starting "/usr/lib/heartbeat/crmd" as uid 107 gid 117 (pid 5254) Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_shutdown: All subsystems stopped, continuing Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_lrm_control: Disconnected from the LRM Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_ha_control: Disconnected from Heartbeat Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_cib_control: Disconnecting CIB Feb 10 16:38:17 arsvr1 crmd: [5251]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Feb 10 16:38:17 arsvr1 crmd: [5251]: ERROR: do_exit: Could not recover from internal error Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_PENDING: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_election_vote ] Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ] Feb 10 16:38:17 arsvr1 crmd: [5251]: info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ] Feb 10 16:38:17 arsvr1 crmd: [5251]: info: do_exit: [crmd] stopped (2) Feb 10 16:38:17 arsvr1 crmd: [5254]: info: Invoked: /usr/lib/heartbeat/crmd Feb 10 16:38:17 arsvr1 crmd: [5254]: info: main: CRM Hg Version: 042548a451fce8400660f6031f4da6f0223dd5dd Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd Feb 10 16:38:17 arsvr1 crmd: [5254]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Feb 10 16:38:17 arsvr1 crmd: [5254]: info: do_cib_control: CIB connection established Feb 10 16:38:17 arsvr1 crmd: [5254]: info: crm_cluster_connect: Connecting to Heartbeat Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: Hostname: arsvr1 Feb 10 16:38:18 arsvr1 crmd: [5254]: info: register_heartbeat_conn: UUID: bf0e7394-9684-42b9-893b-5a9a6ecddd7e Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ha_control: Connected to the cluster Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_ccm_control: CCM connection established... waiting for first callback Feb 10 16:38:18 arsvr1 crmd: [5254]: info: do_started: Delaying start, CCM (0000000000100000) not connected Feb 10 16:38:18 arsvr1 crmd: [5254]: info: crmd_init: Starting crmd's mainloop Feb 10 16:38:18 arsvr1 crmd: [5254]: info: config_query_callback: Checking for expired actions every 900000ms Feb 10 16:38:18 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: Status update: Client arsvr1/crmd now has status [online] (DC=false) Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_new_peer: Node 0 is now known as arsvr1 Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.crmd is now online Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the DC Feb 10 16:38:19 arsvr1 crmd: [5254]: notice: crmd_client_status_callback: Status update: Client arsvr1/crmd now has status [online] (DC=false) Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_client_status_callback: Not the DC Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Feb 10 16:38:19 arsvr1 crmd: [5254]: info: mem_handle_event: instance=1, nodes=1, new=1, lost=0, n_idx=0, new_idx=0, old_idx=3 Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=1) Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: NEW MEMBERSHIP: trans=1, nodes=1, new=1, lost=0 n_idx=0, new_idx=0, old_idx=3 Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011CURRENT: arsvr1 [nodeid=0, born=1] Feb 10 16:38:19 arsvr1 crmd: [5254]: info: ccm_event_detail: #011NEW: arsvr1 [nodeid=0, born=1] Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer: Node arsvr1: id=0 state=member (new) addr=(null) votes=-1 born=1 seen=1 proc=00000000000000000000000000000200 Feb 10 16:38:19 arsvr1 crmd: [5254]: info: crm_update_peer_proc: arsvr1.ais is now online Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_started: The local CRM is operational Feb 10 16:38:19 arsvr1 crmd: [5254]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ] Liang Ma Contractuel | Consultant | SED Systems Inc. Ground Systems Analyst Agence spatiale canadienne | Canadian Space Agency 6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9 Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083 Courriel/E-mail : [liang...@space.gc.ca] Site web/Web site : [www.space.gc.ca ] -----Original Message----- From: Ma, Liang Sent: February 10, 2011 9:08 AM To: The Pacemaker cluster resource manager Subject: RE: [Pacemaker] Could not connect to the CIB: Remote node did notrespond Thanks Andrew. Yes, cibadmin -Ql works, but cibadmin -Q not. What is DC? And here is the logs. Feb 10 08:57:30 arsvr1 cibadmin: [4264]: info: Invoked: cibadmin -Ql Feb 10 08:57:32 arsvr1 cibadmin: [4265]: info: Invoked: cibadmin -Q Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_dc_release: DC role released Feb 10 08:58:04 arsvr1 crmd: [960]: info: do_te_control: Transitioner is now inactive Feb 10 08:58:08 arsvr1 crmd: [960]: info: update_dc: Set DC to arsvr2 (3.0.1) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_local_callback:Sending full refresh (origin=crmd) Feb 10 08:58:10 arsvr1 crmd: [960]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: shutdown (<null>) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_mysql:0 (<null>) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: terminate (<null>) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: master-drbd_webfs:0 (<null>) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_trigger_update:Sending flush op to all hosts for: probe_complete (<null>) Feb 10 08:58:10 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr1 Feb 10 08:58:12 arsvr1 attrd: last message repeated 4 times Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback:flush message from arsvr2 Feb 10 08:58:12 arsvr1 crmd: [960]:notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [offline] (DC=false) Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now offline Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 Feb 10 08:58:12 arsvr1 crmd: [960]: info:crmd_client_status_callback:Got client status callback - our DC is dead Feb 10 08:58:12 arsvr1 crmd: [960]: notice:crmd_client_status_callback: Status update: Client arsvr2/crmd now has status [online] (DC=false) Feb 10 08:58:12 arsvr1 crmd: [960]: info: crm_update_peer_proc:arsvr2.crmd is now online Feb 10 08:58:12 arsvr1 crmd: [960]: info: crmd_client_status_callback:Not the DC Feb 10 08:58:12 arsvr1 crmd: [960]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=crmd_client_status_callback ] Feb 10 08:58:12 arsvr1 crmd: [960]: info: update_dc: Unset DC arsvr2 Feb 10 08:58:12 arsvr1 attrd: [959]: info: attrd_ha_callback: flush message from arsvr2 Feb 10 08:58:14 arsvr1 heartbeat: [898]: WARN: 1 lost packet(s) for [arsvr2] [131787:131789] Feb 10 08:58:14 arsvr1 heartbeat: [898]: info: No pkts missing from arsvr2! Liang Ma Contractuel | Consultant | SED Systems Inc. Ground Systems Analyst Agence spatiale canadienne | Canadian Space Agency 6767, Route de l'Aéroport, Longueuil (St-Hubert), QC, Canada, J3Y 8Y9 Tél/Tel : (450) 926-5099 | Téléc/Fax: (450) 926-5083 Courriel/E-mail : [liang...@space.gc.ca] Site web/Web site : [www.space.gc.ca ] -----Original Message----- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: February 10, 2011 2:39 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Could not connect to the CIB: Remote node did notrespond On Wed, Feb 9, 2011 at 3:59 PM, <liang...@asc-csa.gc.ca> wrote: > Hi There, > > After a network and power shutdown, my LAMP cluster servers were totally > screwed up. > > Now crm status gives me > > crm status > ============ > Last updated: Wed Feb 9 09:44:17 2011 > Stack: Heartbeat > Current DC: arsvr2 (bc6bf61d-6b5f-4307-85f3-bf7bb11531bb) - partition with > quorum > Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd > 2 Nodes configured, 1 expected votes > 4 Resources configured. > ============ > > Online: [ arsvr1 arsvr2 ] > > None of the resources comes up. > > First I found a brain split in drbd disks. I fixed that and the drbd disks > are health. I can mount them manually without problem. > > However if I try anything to bring up a resource or edit cib or even a query, > it gives me errors as following > > crm resource start fs_mysql > Call cib_replace failed (-41): Remote node did not respond <null> > > crm configure edit > Could not connect to the CIB: Remote node did not respond > ERROR: creating tmp shadow __crmshell.2540 failed > > > cibadmin -Q > Call cib_query failed (-41): Remote node did not respond <null> > > Any idea what I can do to bring the cluster back? Seems like you don't have a DC. Hard to say why without logs. Does cibadmin -Ql work? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker