attach the corosync.conf ------------------------------------ compatibility: whitetank totem { version: 2 token: 10000 token_retransmits_before_loss_const: 10 secauth: off threads: 0 interface { ringnumber: 0 member: { memberaddr: 10.0.0.1 } member: { memberaddr: 10.0.0.2 } bindnetaddr: 10.0.0.1 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes syslog_facility: local6 syslog_priority: debug debug:on logfile: /var/log/cluster/corosync.log timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } service{ ver:1 name:pacemaker } aisexec{ user:root group:root }
----------------------------------- 2014-07-18 10:35 GMT+08:00 Emre He <emre...@gmail.com>: > Hi, > > I am working a classic corosync+pacemaker linux-HA cluster (2 servers), > after reboot one server, when it come back, corosync is running, pacemaker > is dead. > > in corosync.log, we can see as below: > -------------------------------------------------------- > Jul 17 03:56:04 [2068] foo.bar.com crmd: info: crmd_exit: Dropping > I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] > Jul 17 03:56:04 [2068] foo.bar.com crmd: debug: > lrm_state_verify_stopped: Checking for active resources before exit > Jul 17 03:56:04 [2068] foo.bar.com crmd: info: crmd_cs_destroy: > connection > closed > Jul 17 03:56:04 [2068] foo.bar.com crmd: info: crmd_init: Inhibiting > automated respawn > *Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com> crmd: > info: crmd_init: 2068 stopped: Network is down (100)* > *Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com> crmd: > warning: crmd_fast_exit: Inhibiting respawn: 100 -> 100* > Jul 17 03:56:04 [2068] foo.bar.com crmd: info: crm_xml_cleanup: > Cleaning > up memory from libxml2 > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: > qb_ipcs_dispatch_connection_request: HUP conn (2057-2068-14) > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: > qb_ipcs_disconnect: qb_ipcs_disconnect(2057-2068-14) state:2 > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: info: > crm_client_destroy: Destroying 0 events > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: qb_rb_close: Free'ing > ringbuffer: /dev/shm/qb-pacemakerd-response-2057-2068-14-header > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: qb_rb_close: Free'ing > ringbuffer: /dev/shm/qb-pacemakerd-event-2057-2068-14-header > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: qb_rb_close: Free'ing > ringbuffer: /dev/shm/qb-pacemakerd-request-2057-2068-14-header > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: error: pcmk_child_exit: > Child > process crmd (2068) exited: Network is down (100) > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: warning: pcmk_child_exit: > Pacemaker > child process crmd no longer wishes to be respawned. Shutting ourselves > down. > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: > update_node_processes: Node foo.bar.com now has process list: > 00000000000000000000000000111112 (was 00000000000000000000000000111312) > *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd: > notice: pcmk_shutdown_worker: Shuting down Pacemaker* > *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd: > debug: pcmk_shutdown_worker: crmd confirmed stopped* > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: notice: stop_child: Stopping > pengine: Sent -15 to process 2067 > Jul 17 03:56:04 [2067] foo.bar.com pengine: info: > crm_signal_dispatch: Invoking handler for signal 15: Terminated > Jul 17 03:56:04 [2067] foo.bar.com pengine: info: > qb_ipcs_us_withdraw: withdrawing server sockets > > > Jul 17 03:56:04 [2063] foo.bar.com cib: debug: qb_ipcs_unref: > qb_ipcs_unref() > - destroying > Jul 17 03:56:04 [2063] foo.bar.com cib: info: crm_xml_cleanup: > Cleaning > up memory from libxml2 > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: info: pcmk_child_exit: > Child > process cib (2063) exited: OK (0) > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: debug: > update_node_processes: Node foo.bar.com now has process list: > 00000000000000000000000000000002 (was 00000000000000000000000000000102) > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: warning: > qb_ipcs_event_sendv: new_event_notification (2057-2063-13): Broken pipe > (32) > *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd: > debug: pcmk_shutdown_worker: cib confirmed stopped* > *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd: > notice: pcmk_shutdown_worker: Shutdown complete* > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: notice: > pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error > Jul 17 03:56:04 [2057] foo.bar.com pacemakerd: info: crm_xml_cleanup: > Cleaning > up memory from libxml2 > Jul 17 03:56:04 corosync [CPG ] exit_fn for conn=0x17e3a20 > Jul 17 03:56:04 corosync [pcmk ] WARN: route_ais_message: Sending message > to local.stonith-ng failed: ipc delivery failed (rc=-2) > Jul 17 03:56:04 corosync [CPG ] got procleave message from cluster node > 433183754 > Jul 17 03:56:07 corosync [pcmk ] WARN: route_ais_message: Sending message > to local.cib failed: ipc delivery failed (rc=-2) > *Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending > message to local.stonith-ng failed: ipc delivery failed (rc=-2)* > *Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending > message to local.stonith-ng failed: ipc delivery failed (rc=-2)* > -------------------------------------------------------- > > here is my HA cluster parameters and package versions > -------------------------------------------------------- > property cib-bootstrap-options: \ > dc-version=1.1.10-1.el6_4.4-368c726 \ > cluster-infrastructure="classic openais (with plugin)" \ > expected-quorum-votes=2 \ > stonith-enabled=false \ > no-quorum-policy=ignore \ > start-failure-is-fatal=false \ > default-action-timeout=300s > rsc_defaults rsc-options: \ > resource-stickiness=100 > > > pacemaker-1.1.10-1.el6_4.4.x86_64 > corosync-1.4.1-15.el6_4.1.x86_64 > > -------------------------------------------------------- > > I am not sure if network has flash disconnection, both servers are VMware > VMs, but looks logs show that. > so is it the root cause of unexpected network issues? actually I > understand that's what HA should handle. > or any other clue about the root cause? > > many thanks, > Emre >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org