Hello,

I'm working on a two node cluster. The nodes are r1nren (r1) and r2nren (r2). There are some resources at the moment, but I think it's not important for this problem.

Both nodes are virtual servers running on vmware. Both nodes are running debian strech, I'm using corosync and pacemaker for the cluster. Complete list of used version below:

root@r2nren:~# uname -a
Linux r2nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) x86_64 GNU/Linux
root@r2nren:~# dpkg -l | grep corosync
ii corosync 2.4.2-3 amd64 cluster engine daemon and utilities ii libcorosync-common4:amd64 2.4.2-3 amd64 cluster engine common library
root@r2nren:~# dpkg -l | grep pacemaker
ii crmsh 2.3.2-4 all CRM shell for the pacemaker cluster manager ii pacemaker 1.1.16-1 amd64 cluster resource manager ii pacemaker-cli-utils 1.1.16-1 amd64 cluster resource manager command line utilities ii pacemaker-common 1.1.16-1 all cluster resource manager common files ii pacemaker-resource-agents 1.1.16-1 all cluster resource manager general resource agents

root@r1nren:~# uname -a
Linux r1nren.et.cesnet.cz 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u3 (2017-08-06) x86_64 GNU/Linux
root@r1nren:~# dpkg -l | grep corosync
ii corosync 2.4.2-3 amd64 cluster engine daemon and utilities ii libcorosync-common4:amd64 2.4.2-3 amd64 cluster engine common library
root@r1nren:~# dpkg -l | grep pacemaker
ii crmsh 2.3.2-4 all CRM shell for the pacemaker cluster manager ii pacemaker 1.1.16-1 amd64 cluster resource manager ii pacemaker-cli-utils 1.1.16-1 amd64 cluster resource manager command line utilities ii pacemaker-common 1.1.16-1 all cluster resource manager common files ii pacemaker-resource-agents 1.1.16-1 all cluster resource manager general resource agents

When the cluster is operating fine, the state is:
root@r2nren:~# crm status
Stack: corosync
Current DC: r1nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition with quorum
Last updated: Tue Oct 10 10:12:22 2017
Last change: Mon Oct 9 13:09:59 2017 by root via crm_attribute on r1nren.et.cesnet.cz

2 nodes configured
8 resources configured

Online: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]

Full list of resources:

 Clone Set: clone_ping_gw [ping_gw]
     Started: [ r1nren.et.cesnet.cz r2nren.et.cesnet.cz ]
 Resource Group: group_eduroam.cz
     standby_ip (ocf::heartbeat:IPaddr2):       Started r1nren.et.cesnet.cz
     offline_file       (systemd:offline_file): Started r1nren.et.cesnet.cz
     racoon     (systemd:racoon):       Started r1nren.et.cesnet.cz
     radiator   (systemd:radiator):     Started r1nren.et.cesnet.cz
     eduroam_ping       (systemd:eduroam_ping): Started r1nren.et.cesnet.cz
     mailto     (ocf::heartbeat:MailTo):        Started r1nren.et.cesnet.cz


I've discovered that if i reboot any of the nodes using just command 'reboot' from terminal or if reboot them from the vmware web interface, everything performs fine. The node undergoing reboot disconnects from cluster and reconnects again.

The problem appears when I shutdown (guest os shutdown not force shutdown) the machine from vmware web interface and start it again. The machine is unable to join the cluster. Pacemaker and corosync are not running. The pacemaker says, the it failed on dependency, which is obviuously corosync.

The corosync says:

root@r1nren:~# crm status
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected

root@r1nren:~# service corosync status
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Tue 2017-10-10 10:27:10 CEST; 1min 10s ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Process: 709 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=ABRT)
 Main PID: 709 (code=killed, signal=ABRT)

Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: Assertion `sender_node != NULL' failed. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main process exited, code=killed, status=6/ABRT Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync Cluster Engine. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit entered failed state. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed with result 'signal'.

root@r1nren:~# journalctl -u corosync
Oct 10 10:26:58 r1nren.et.cesnet.cz systemd[1]: Starting Corosync Cluster Engine... Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service. Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [MAIN ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snm Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] Initializing transport (UDP/IP Unicast). Oct 10 10:26:58 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256 Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The network interface is down. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync configuration map access [0] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cmap Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync configuration service [1] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cfg Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: cpg Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync profile loading service [4] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync resource monitoring service [6] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] No Watchdog /dev/watchdog, try modprobe <a watchdog> Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource load_15min missing a recovery key. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] resource memory_used missing a recovery key. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [WD ] no resources configured. Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync watchdog service [7] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QUORUM] Using quorum provider corosync_votequorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: votequorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [QB ] server name: quorum Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.51} Oct 10 10:26:59 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.52} Oct 10 10:27:00 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:01 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:02 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-16) Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] The network interface [78.128.211.51] is now up. Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.51} Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [TOTEM ] adding new UDPU member {78.128.211.52} Oct 10 10:27:03 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:04 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB ] Denied connection, is not ready (709-1337-18) Oct 10 10:27:10 r1nren.et.cesnet.cz corosync[709]: corosync: votequorum.c:2065: message_handler_req_exec_votequorum_nodeinfo: Assertion `sender_node != NULL' failed. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Main process exited, code=killed, status=6/ABRT Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: Failed to start Corosync Cluster Engine. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Unit entered failed state. Oct 10 10:27:10 r1nren.et.cesnet.cz systemd[1]: corosync.service: Failed with result 'signal'.

corosync configuration:
root@r1nren:~# cat /etc/corosync/corosync.conf
totem {
        version: 2
        transport: udpu
        cluster_name: eduroam.cz
        token: 3000
        token_retransmits_before_loss_const: 10
        clear_node_high_bit: yes
        crypto_cipher: aes256
        crypto_hash: sha256
        interface {
                ringnumber: 0
                bindnetaddr: 78.128.211.51
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: no
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
}

nodelist{
                node {
                        ring0_addr: 78.128.211.51
                }
                node {
                        ring0_addr: 78.128.211.52
                }
}


Let me know if I can provide any more information about this (are there any corosync logs?).

View from r2:
root@r2nren:~# crm status
Stack: corosync
Current DC: r2nren.et.cesnet.cz (version 1.1.16-94ff4df) - partition with quorum
Last updated: Tue Oct 10 10:29:45 2017
Last change: Tue Oct 10 10:25:32 2017 by root via crm_attribute on r1nren.et.cesnet.cz

2 nodes configured
8 resources configured

Online: [ r2nren.et.cesnet.cz ]
OFFLINE: [ r1nren.et.cesnet.cz ]

Full list of resources:

 Clone Set: clone_ping_gw [ping_gw]
     Started: [ r2nren.et.cesnet.cz ]
     Stopped: [ r1nren.et.cesnet.cz ]
 Resource Group: group_eduroam.cz
     standby_ip (ocf::heartbeat:IPaddr2):       Started r2nren.et.cesnet.cz
     offline_file       (systemd:offline_file): Started r2nren.et.cesnet.cz
     racoon     (systemd:racoon):       Started r2nren.et.cesnet.cz
     radiator   (systemd:radiator):     Started r2nren.et.cesnet.cz
     eduroam_ping       (systemd:eduroam_ping): Started r2nren.et.cesnet.cz
     mailto     (ocf::heartbeat:MailTo):        Started r2nren.et.cesnet.cz


What could be the problem I encountered?

Thanks for help.

Regards,
Vaclav

--
Václav Mach
CESNET, z.s.p.o.
www.cesnet.cz

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to