Hello,I'm working on a two node cluster. The nodes are r1nren (r1) and r2nren (r2). There are some resources at the moment, but I think it's not important for this problem.
Both nodes are virtual servers running on vmware. Both nodes are running debian strech, I'm using corosync and pacemaker for the cluster. Complete list of used version below:
root@r2nren:~# uname -aLinux r2nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) x86_64 GNU/Linux
root@r2nren:~# dpkg -l | grep corosyncii corosync 2.4.2-3 amd64 cluster engine daemon and utilities ii libcorosync-common4:amd64 2.4.2-3 amd64 cluster engine common library
root@r2nren:~# dpkg -l | grep pacemakerii crmsh 2.3.2-4 all CRM shell for the pacemaker cluster manager ii pacemaker 1.1.16-1 amd64 cluster resource manager ii pacemaker-cli-utils 1.1.16-1 amd64 cluster resource manager command line utilities ii pacemaker-common 1.1.16-1 all cluster resource manager common files ii pacemaker-resource-agents 1.1.16-1 all cluster resource manager general resource agents
root@r1nren:~# uname -aLinux r1nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) x86_64 GNU/Linux
root@r1nren:~# dpkg -l | grep corosyncii corosync 2.4.2-3 amd64 cluster engine daemon and utilities ii libcorosync-common4:amd64 2.4.2-3 amd64 cluster engine common library
root@r1nren:~# dpkg -l | grep pacemakerii crmsh 2.3.2-4 all CRM shell for the pacemaker cluster manager ii pacemaker 1.1.16-1 amd64 cluster resource manager ii pacemaker-cli-utils 1.1.16-1 amd64 cluster resource manager command line utilities ii pacemaker-common 1.1.16-1 all cluster resource manager common files ii pacemaker-resource-agents 1.1.16-1 all cluster resource manager general resource agents
When the cluster is operating fine, the state is: root@r2nren:~# crm status Stack: corosyncCurrent DC: r1nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition with quorum
Last updated: Tue Oct 10 10:12:22 2017Last change: Mon Oct 9 13:09:59 2017 by root via crm_attribute on r1nren.et.cesnet.cz
2 nodes configured 8 resources configured Online: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ] Full list of resources: Clone Set: clone_ping_gw [ping_gw] Started: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ] Resource Group: group_eduroam.cz standby_ip (ocf::heartbeat:IPaddr2): Started r1nren.et.cesnet.cz offline_file (systemd:offline_file): Started r1nren.et.cesnet.cz racoon (systemd:racoon): Started r1nren.et.cesnet.cz radiator (systemd:radiator): Started r1nren.et.cesnet.cz eduroam_ping (systemd:eduroam_ping): Started r1nren.et.cesnet.cz mailto (ocf::heartbeat:MailTo): Started r1nren.et.cesnet.czI've discovered that if i reboot any of the nodes using just command 'reboot' from terminal or if reboot them from the vmware web interface, everything performs fine. The node undergoing reboot disconnects from cluster and reconnects again.
The problem appears when I shutdown (guest os shutdown not force shutdown) the machine from vmware web interface and start it again. The machine is unable to join the cluster. Pacemaker and corosync are not running. The pacemaker says, the it failed on dependency, which is obviuously corosync.
The corosync says: root@r1nren:~# crm statusERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
root@r1nren:~# service corosync status ● corosync.service - Corosync Cluster EngineLoaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Tue 2017-10-10 10:27:10 CEST; 1min 10s ago
Docs: man:corosync man:corosync.conf man:corosync_overviewProcess: 709 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=ABRT)
Main PID: 709 (code=killed, signal=ABRT)Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: Assertion `sender_node != NULL' failed. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main process exited, code=killed, status=6/ABRT Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync Cluster Engine. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit entered failed state. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed with result 'signal'.
root@r1nren:~# journalctl -u corosyncOct 10 10:26:58 r1nren.et.cesnet.cz systemd[1]: Starting Corosync Cluster Engine... Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service. Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snm Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] Initializing transport (UDP/IP Unicast). Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256 Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The network interface is down. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync configuration map access [0] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cmap Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync configuration service [1] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cfg Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cpg Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync profile loading service [4] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync resource monitoring service [6] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] No Watchdog /dev/watchdog, try modprobe <a watchdog> Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource load_15min missing a recovery key. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource memory_used missing a recovery key. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] no resources configured. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync watchdog service [7] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QUORUM] Using quorum provider corosync_votequorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: votequorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: quorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.51} Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.52} Oct 10 10:27:00 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:01 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:02 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The network interface [78.128.211.51] is now up. Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.51} Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.52} Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:04 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: Assertion `sender_node != NULL' failed. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main process exited, code=killed, status=6/ABRT Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync Cluster Engine. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit entered failed state. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed with result 'signal'.
corosync configuration: root@r1nren:~# cat /etc/corosync/corosync.conf totem { version: 2 transport: udpu cluster_name: eduroam.cz token: 3000 token_retransmits_before_loss_const: 10 clear_node_high_bit: yes crypto_cipher: aes256 crypto_hash: sha256 interface { ringnumber: 0 bindnetaddr: 78.128.211.51 ttl: 1 } } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 } nodelist{ node { ring0_addr: 78.128.211.51 } node { ring0_addr: 78.128.211.52 } }Let me know if I can provide any more information about this (are there any corosync logs?).
View from r2: root@r2nren:~# crm status Stack: corosyncCurrent DC: r2nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition with quorum
Last updated: Tue Oct 10 10:29:45 2017Last change: Tue Oct 10 10:25:32 2017 by root via crm_attribute on r1nren.et.cesnet.cz
2 nodes configured 8 resources configured Online: [ r2nren.et.cesnet.cz ] OFFLINE: [ r1nren.et.cesnet.cz ] Full list of resources: Clone Set: clone_ping_gw [ping_gw] Started: [ r2nren.et.cesnet.cz ] Stopped: [ r1nren.et.cesnet.cz ] Resource Group: group_eduroam.cz standby_ip (ocf::heartbeat:IPaddr2): Started r2nren.et.cesnet.cz offline_file (systemd:offline_file): Started r2nren.et.cesnet.cz racoon (systemd:racoon): Started r2nren.et.cesnet.cz radiator (systemd:radiator): Started r2nren.et.cesnet.cz eduroam_ping (systemd:eduroam_ping): Started r2nren.et.cesnet.cz mailto (ocf::heartbeat:MailTo): Started r2nren.et.cesnet.cz What could be the problem I encountered? Thanks for help. Regards, Vaclav -- Václav Mach CESNET, z.s.p.o. www.cesnet.cz
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org